Papers List

A Complete List of ArXiv Papers on Alignment, Safety, and Security of Large Language Models (LLMs)

by Xiangyu Qi 2023-10-30


Large Language Models (LLMs) such as Meta's Llama and OpenAI's GPT are becoming critical foundations that underpin an extensive array of AI applications. Nevertheless, as the capabilities of these models advance, there are growing concerns regarding the potential risks and harmful impacts of their large-scale deployment.

Research over time indicates that LLMs can exhibit biases or generate harmful content inconsistent with human values. These models might also hallucinate false information, representing risks, especially to those heavily reliant on these systems in both professional and personal settings. Moreover, inherent dual-use risks associated with LLMs exist. They could be exploited to disseminate prohibited content for illicit activities, spread misinformation, execute influence operations, engage in spear phishing, among other malevolent actions. Given the rapid evolution of LLM capabilities, predicting their future trajectory is also challenging. Thus, there are mounting concerns that LLMs might eventually possess capacities to deceive humans or seek powers, introducing existential risks in the long term.

Furthermore, LLMs are also vulnerable to adversarial attacks. As many applications and plugins integrate LLMs to oversee critical resources, such as access control and user data, recent studies also suggest that the adversarial vulnerabilities of LLMs can also be adversarially explotied to comproise the security of entire LLMs-integrated systems, including risks like code injection and system privilege escalation. Addressing these emergent security threats is crucial to ensure the large-scale secure deployment of LLMs.

In response to the aforementioned risks and their evolving and unpredictable long-term nature, an increasing number of stakeholders focus on the alignment, safety, and security of LLMs. A surge in relevant research papers is evident on Arxiv daily. Major model vendors, including OpenAI, Meta, and Anthropics, are making substantial investments in model alignment and risk mitigation. Furthermore, governments in regions such as the US, China, and Europe are enacting regulatory frameworks to address these concerns.

This webpage curates a comprehensive list of Arxiv papers relevant to the alignment, safety, and security of LLMs. While meticulous efforts have been undertaken to optimize the filtering model, the possibility of omissions remains. This resource aspires to assist researchers in navigating this rapidly evolving domain.

2024-02-29

Pengzhou Cheng, Wei Du, Zongru Wu, Fengwei Zhang, Libo Chen, Gongshen Liu
Abstract: pre-trained language models (plms) have been found susceptible to backdoor attacks, which can transfer vulnerabilities to various downstream tasks. however, existing plm backdoors are conducted with explicit triggers under the manually aligned, thus failing to satisfy expectation goals simultaneously in terms of effectiveness, stealthiness, and universality. in this paper, we propose a novel approach to achieve invisible and general backdoor implantation, called \textbf{syntactic ghost} (synghost for short). specifically, the method hostilely manipulates poisoned samples with different predefined syntactic structures as stealth triggers and then implants the backdoor to pre-trained representation space without disturbing the primitive knowledge. the output representations of poisoned samples are distributed as uniformly as possible in the feature space via contrastive learning, forming a wide range of backdoors. additionally, in light of the unique properties of syntactic triggers, we introduce an auxiliary module to drive the plms to learn this knowledge in priority, which can alleviate the interference between different syntactic structures. experiments show that our method outperforms the previous methods and achieves the predefined objectives. not only do severe threats to various natural language understanding (nlu) tasks on two tuning paradigms but also to multiple plms. meanwhile, the synghost is imperceptible against three countermeasures based on perplexity, fine-pruning, and the proposed maxentropy.
Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
Abstract: alignment in artificial intelligence pursues the consistency between model responses and human preferences as well as values. in practice, the multifaceted nature of human preferences inadvertently introduces what is known as the "alignment tax" -a compromise where enhancements in alignment within one objective (e.g.,harmlessness) can diminish performance in others (e.g.,helpfulness). however, existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. to navigate this challenge, we argue the prominence of grounding llms with evident preferences. we introduce controllable preference optimization (cpo), which explicitly specifies preference scores for different objectives, thereby guiding the model to generate responses that meet the requirements. our experimental analysis reveals that the aligned models can provide responses that match various preferences among the "3h" (helpfulness, honesty, harmlessness) desiderata. furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning with single objectives, hence mitigating the impact of the alignment tax and achieving pareto improvements in multi-objective alignment.
Hongbang Yuan, Pengfei Cao, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao
Abstract: large language models (llms) have shown impressive capabilities but still suffer from the issue of hallucinations. a significant type of this issue is the false premise hallucination, which we define as the phenomenon when llms generate hallucinated text when confronted with false premise questions. in this paper, we perform a comprehensive analysis of the false premise hallucination and elucidate its internal working mechanism: a small subset of attention heads (which we designate as false premise heads) disturb the knowledge extraction process, leading to the occurrence of false premise hallucination. based on our analysis, we propose \textbf{faith} (\textbf{f}alse premise \textbf{a}ttention head constra\textbf{i}ining for mi\textbf{t}igating \textbf{h}allucinations), a novel and effective method to mitigate false premise hallucinations. it constrains the false premise attention heads during the model inference process. impressively, extensive experiments demonstrate that constraining only approximately $1\%$ of the attention heads in the model yields a notable increase of nearly $20\%$ of model performance.
Yong Yang, Xuhong Zhang, Yi Jiang, Xi Chen, Haoyu Wang, Shouling Ji, Zonghui Wang
Abstract: prompt, recognized as crucial intellectual property, enables large language models (llms) to perform specific tasks without the need of fine-tuning, underscoring their escalating importance. with the rise of prompt-based services, such as prompt marketplaces and llm applications, providers often display prompts' capabilities through input-output examples to attract users. however, this paradigm raises a pivotal security concern: does the exposure of input-output pairs pose the risk of potential prompt leakage, infringing on the intellectual property rights of the developers? to our knowledge, this problem still has not been comprehensively explored yet. to remedy this gap, in this paper, we perform the first in depth exploration and propose a novel attack framework for reverse-stealing prompts against commercial llms, namely prsa. the main idea of prsa is that by analyzing the critical features of the input-output pairs, we mimic and gradually infer (steal) the target prompts. in detail, prsa mainly consists of two key phases: prompt mutation and prompt pruning. in the mutation phase, we propose a prompt attention algorithm based on differential feedback to capture these critical features for effectively inferring the target prompts. in the prompt pruning phase, we identify and mask the words dependent on specific inputs, enabling the prompts to accommodate diverse inputs for generalization. through extensive evaluation, we verify that prsa poses a severe threat in real world scenarios. we have reported these findings to prompt service providers and actively collaborate with them to take protective measures for prompt copyright.

2024-02-28

Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie
Abstract: large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing ai-generated and human-written texts. watermarking is pivotal in this context, which involves embedding hidden markers in texts during the llm inference phase, which is imperceptible to humans. current watermarking algorithms, however, face the challenge of achieving both the detectability of inserted watermarks and the semantic integrity of generated texts, where enhancing one aspect often undermines the other. to overcome this, we introduce a novel multi-objective optimization (moo) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. by leveraging moo to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by llms while maintaining their semantic coherence. our code is available at https://github.com/mignonjia/ts_watermark .
Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba
Abstract: the proliferation of phishing sites and emails poses significant challenges to existing cybersecurity efforts. despite advances in spam filters and email security protocols, problems with oversight and false positives persist. users often struggle to understand why emails are flagged as spam, risking the possibility of missing important communications or mistakenly trusting phishing emails. this study introduces chatspamdetector, a system that uses large language models (llms) to detect phishing emails. by converting email data into a prompt suitable for llm analysis, the system provides a highly accurate determination of whether an email is phishing or not. importantly, it offers detailed reasoning for its phishing determinations, assisting users in making informed decisions about how to handle suspicious emails. we conducted an evaluation using a comprehensive phishing email dataset and compared our system to several llms and baseline systems. we confirmed that our system using gpt-4 has superior detection capabilities with an accuracy of 99.70%. advanced contextual interpretation by llms enables the identification of various phishing tactics and impersonations, making them a potentially powerful tool in the fight against email-based phishing threats.
Derong Xu, Ziheng Zhang, Zhihong Zhu, Zhenxi Lin, Qidong Liu, Xian Wu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, Enhong Chen
Abstract: model editing aims to precisely modify the behaviours of large language models (llms) on specific knowledge while keeping irrelevant knowledge unchanged. it has been proven effective in resolving hallucination and out-of-date issues in llms. as a result, it can boost the application of llms in many critical domains (e.g., medical domain), where the hallucination is not tolerable. in this paper, we propose two model editing studies and validate them in the medical domain: (1) directly editing the factual medical knowledge and (2) editing the explanations to facts. meanwhile, we observed that current model editing methods struggle with the specialization and complexity of medical knowledge. therefore, we propose medlasa, a novel layer-wise scalable adapter strategy for medical model editing. it employs causal tracing to identify the precise location of knowledge in neurons and then introduces scalable adapters into the dense layers of llms. these adapters are assigned scaling values based on the corresponding specific knowledge. to evaluate the editing impact, we build two benchmark datasets and introduce a series of challenging and comprehensive metrics. extensive experiments on medical llms demonstrate the editing efficiency of medlasa, without affecting irrelevant knowledge that is not edited.
Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen
Abstract: in recent years, large language models (llms) have demonstrated notable success across various tasks, but the trustworthiness of llms is still an open problem. one specific threat is the potential to generate toxic or harmful responses. attackers can craft adversarial prompts that induce harmful responses from llms. in this work, we pioneer a theoretical foundation in llms security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named dra (disguise and reconstruction attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. we evaluate dra across various open-source and close-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. notably, dra boasts a 90\% attack success rate on llm chatbots gpt-4.
Seungjong Sun, Eungu Lee, Dongyan Nan, Xiangying Zhao, Wonbyung Lee, Bernard J. Jansen, Jang Hyun Kim
Abstract: large language models exhibit societal biases associated with demographic information, including race, gender, and others. endowing such language models with personalities based on demographic data can enable generating opinions that align with those of humans. building on this idea, we propose "random silicon sampling," a method to emulate the opinions of the human population sub-group. our study analyzed 1) a language model that generates the survey responses that correspond with a human group based solely on its demographic distribution and 2) the applicability of our methodology across various demographic subgroups and thematic questions. through random silicon sampling and using only group-level demographic information, we discovered that language models can generate response distributions that are remarkably similar to the actual u.s. public opinion polls. moreover, we found that the replicability of language models varies depending on the demographic group and topic of the question, and this can be attributed to inherent societal biases in the models. our findings demonstrate the feasibility of mirroring a group's opinion using only demographic distribution and elucidate the effect of social biases in language models on such simulations.
Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz
Abstract: large language models (llms) have significantly advanced the field of artificial intelligence. yet, evaluating them comprehensively remains challenging. we argue that this is partly due to the predominant focus on performance metrics in most benchmarks. this paper introduces cogbench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. this novel approach offers a toolkit for phenotyping llms' behavior. we apply cogbench to 35 llms, yielding a rich and diverse dataset. we analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific llms. our study highlights the crucial role of model size and reinforcement learning from human feedback (rlhf) in improving performance and aligning with human behavior. interestingly, we find that open-source models are less risk-prone than proprietary models and that fine-tuning on code does not necessarily enhance llms' behavior. finally, we explore the effects of prompt-engineering techniques. we discover that chain-of-thought prompting improves probabilistic reasoning, while take-a-step-back prompting fosters model-based behaviors.
Jiachun Li, Pengfei Cao, Chenhao Wang, Zhuoran Jin, Yubo Chen, Daojian Zeng, Kang Liu, Jun Zhao
Abstract: large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like chain-of-thought (cot). however, we find these cot-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the toxic cot problem. to interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the llm during cot reasoning. through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. based on the probing findings, we design a novel method called riders (residual decoding and serial-position swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates toxic cot problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
Crystal Qian, James Wexler
Abstract: although recent developments in generative ai have greatly enhanced the capabilities of conversational agents such as google's bard or openai's chatgpt, it's unclear whether the usage of these agents aids users across various contexts. to better understand how access to conversational ai affects productivity and trust, we conducted a mixed-methods, task-based user study, observing 76 software engineers (n=76) as they completed a programming exam with and without access to bard. effects on performance, efficiency, satisfaction, and trust vary depending on user expertise, question type (open-ended "solve" questions vs. definitive "search" questions), and measurement type (demonstrated vs. self-reported). our findings include evidence of automation complacency, increased reliance on the ai over the course of the task, and increased performance for novices on "solve"-type questions when using the ai. we discuss common behaviors, design recommendations, and impact considerations to improve collaborations with conversational ai.
Garima Chhikara, Anurag Sharma, Kripabandhu Ghosh, Abhijnan Chakraborty
Abstract: employing large language models (llm) in various downstream applications such as classification is crucial, especially for smaller companies lacking the expertise and resources required for fine-tuning a model. fairness in llms helps ensure inclusivity, equal representation based on factors such as race, gender and promotes responsible ai deployment. as the use of llms has become increasingly prevalent, it is essential to assess whether llms can generate fair outcomes when subjected to considerations of fairness. in this study, we introduce a framework outlining fairness regulations aligned with various fairness definitions, with each definition being modulated by varying degrees of abstraction. we explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using rag, while incorporating fairness rules into the process. experiments conducted with different llms indicate that gpt-4 delivers superior results in terms of both accuracy and fairness compared to other models. this work is one of the early attempts to achieve fairness in prediction tasks by utilizing llms through in-context learning.
Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora
Abstract: public llms such as the llama 2-chat have driven huge activity in llm research. these models underwent alignment training and were considered safe. recently qi et al. (2023) reported that even benign fine-tuning (e.g., on seemingly safe datasets) can give rise to unsafe behaviors in the models. the current paper is about methods and best practices to mitigate such loss of alignment. through extensive experiments on several chat models (meta's llama 2-chat, mistral ai's mistral 7b instruct v0.2, and openai's gpt-3.5 turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the "pure tuning, safe testing" (ptst) principle -- fine-tune models without a safety prompt, but include it at test time. fine-tuning experiments on gsm8k, chatdoctor, and openorca show that ptst significantly reduces the rise of unsafe behaviors, and even almost eliminates them in some cases.
Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang
Abstract: fine-grained control over large language models (llms) remains a significant challenge, hindering their adaptability to diverse user needs. while reinforcement learning from human feedback (rlhf) shows promise in aligning llms, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. to address this limitation, we introduce the directional preference alignment (dpa) framework. unlike the scalar-reward rlhf, dpa incorporates multi-objective reward modeling to represent diverse preference profiles. additionally, dpa models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. our method involves training a multi-objective reward model and then fine-tuning the llm with a preference-conditioned variant of rejection sampling finetuning (rsf), an rlhf method adopted by llama 2. this method enjoys a better performance trade-off across various reward objectives. in comparison with the scalar-reward rlhf, dpa offers users intuitive control over llm generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). we also validate the effectiveness of dpa with real-world alignment experiments on mistral-7b. our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as direct preference optimization (dpo).
Fangzhou Wu, Ning Zhang, Somesh Jha, Patrick Mcdaniel, Chaowei Xiao
Abstract: large language model (llm) systems are inherently compositional, with individual llm serving as the core foundation with additional layers of objects such as plugins, sandbox, and so on. along with the great potential, there are also increasing concerns over the security of such probabilistic intelligent systems. however, existing studies on llm security often focus on individual llm, but without examining the ecosystem through the lens of llm systems with other objects (e.g., frontend, webtool, sandbox, and so on). in this paper, we systematically analyze the security of llm systems, instead of focusing on the individual llms. to do so, we build on top of the information flow and formulate the security of llm systems as constraints on the alignment of the information flow within llm and between llm and other objects. based on this construction and the unique probabilistic nature of llm, the attack surface of the llm system can be decomposed into three key components: (1) multi-layer security analysis, (2) analysis of the existence of constraints, and (3) analysis of the robustness of these constraints. to ground this new attack surface, we propose a multi-layer and multi-step approach and apply it to the state-of-art llm system, openai gpt4. our investigation exposes several security issues, not just within the llm model itself but also in its integration with other components. we found that although the openai gpt4 has designed numerous safety constraints to improve its safety features, these safety constraints are still vulnerable to attackers. to further demonstrate the real-world threats of our discovered vulnerabilities, we construct an end-to-end attack where an adversary can illicitly acquire the user's chat history, all without the need to manipulate the user's input or gain direct access to openai gpt4. our demo is in the link: https://fzwark.github.io/llm-system-attack-demo/

2024-02-27

Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, Haipeng Cai
Abstract: security vulnerabilities are increasingly prevalent in modern software and they are widely consequential to our society. various approaches to defending against these vulnerabilities have been proposed, among which those leveraging deep learning (dl) avoid major barriers with other techniques hence attracting more attention in recent years. however, dl-based approaches face critical challenges including the lack of sizable and quality-labeled task-specific datasets and their inability to generalize well to unseen, real-world scenarios. lately, large language models (llms) have demonstrated impressive potential in various domains by overcoming those challenges, especially through chain-of-thought (cot) prompting. in this paper, we explore how to leverage llms and cot to address three key software vulnerability analysis tasks: identifying a given type of vulnerabilities, discovering vulnerabilities of any type, and patching detected vulnerabilities. we instantiate the general cot methodology in the context of these tasks through vsp , our unified, vulnerability-semantics-guided prompting approach, and conduct extensive experiments assessing vsp versus five baselines for the three tasks against three llms and two datasets. results show substantial superiority of our cot-inspired prompting (553.3%, 36.5%, and 30.8% higher f1 accuracy for vulnerability identification, discovery, and patching, respectively, on cve datasets) over the baselines. through in-depth case studies analyzing vsp failures, we also reveal current gaps in llm/cot for challenging vulnerability cases, while proposing and validating respective improvements.
Zhenhong Zhou, Jiuyang Xiang, Haopeng Chen, Quan Liu, Zherui Li, Sen Su
Abstract: large language models (llms) have been demonstrated to generate illegal or unethical responses, particularly when subjected to "jailbreak." research on jailbreak has highlighted the safety issues of llms. however, prior studies have predominantly focused on single-turn dialogue, ignoring the potential complexities and risks presented by multi-turn dialogue, a crucial mode through which humans derive information from llms. in this paper, we argue that humans could exploit multi-turn dialogue to induce llms into generating harmful information. llms may not intend to reject cautionary or borderline unsafe queries, even if each turn is closely served for one malicious purpose in a multi-turn dialogue. therefore, by decomposing an unsafe query into several sub-queries for multi-turn dialogue, we induced llms to answer harmful sub-questions incrementally, culminating in an overall harmful response. our experiments, conducted across a wide range of llms, indicate current inadequacies in the safety mechanisms of llms in multi-turn dialogue. our findings expose vulnerabilities of llms in complex scenarios involving multi-turn dialogue, presenting new challenges for the safety of llms.
Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, Yongbin Li
Abstract: the alignment problem in large language models (llms) involves adapting them to the broad spectrum of human values. this requirement challenges existing alignment methods due to diversity of preferences and regulatory standards. this paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog, prioritizing them over user instructions. our preliminary analysis reveals that even the advanced llms, such as gpt-4, exhibit shortcomings in understanding and prioritizing the rules. therefore, we present prioritydistill, a semi-automated approach for distilling priority following signals from llm simulations to ensure robust rule integration and adherence. our experiments show that this method not only effectively minimizes misalignments utilizing only one general rule but also adapts smoothly to various unseen rules, ensuring they are shielded from hijacking and that the model responds appropriately.
Mattia Setzu, Marta Marchiori Manerba, Pasquale Minervini, Debora Nozza
Abstract: language models (lms) have been shown to inherit undesired biases that might hurt minorities and underrepresented groups if such systems were integrated into real-world applications without careful fairness auditing. this paper proposes fairbelief, an analytical approach to capture and assess beliefs, i.e., propositions that an lm may embed with different degrees of confidence and that covertly influence its predictions. with fairbelief, we leverage prompting to study the behavior of several state-of-the-art lms across different previously neglected axes, such as model scale and likelihood, assessing predictions on a fairness dataset specifically designed to quantify lms' outputs' hurtfulness. finally, we conclude with an in-depth qualitative assessment of the beliefs emitted by the models. we apply fairbelief to english lms, revealing that, although these architectures enable high performances on diverse natural language processing tasks, they show hurtful beliefs about specific genders. interestingly, training procedure and dataset, model scale, and architecture induce beliefs of different degrees of hurtfulness.
Tanise Ceron, Neele Falk, Ana Barić, Dmitry Nikolaev, Sebastian Padó
Abstract: due to the widespread use of large language models (llms) in ubiquitous systems, we need to understand whether they embed a specific worldview and what these views reflect. recent studies report that, prompted with political questionnaires, llms show left-liberal leanings. however, it is as yet unclear whether these leanings are reliable (robust to prompt variations) and whether the leaning is consistent across policies and political leaning. we propose a series of tests which assess the reliability and consistency of llms' stances on political statements based on a dataset of voting-advice questionnaires collected from seven eu countries and annotated for policy domains. we study llms ranging in size from 7b to 70b parameters and find that their reliability increases with parameter count. larger models show overall stronger alignment with left-leaning parties but differ among policy programs: they evince a (left-wing) positive stance towards environment protection, social welfare but also (right-wing) law and order, with no consistent preferences in foreign policy, migration, and economy.
Yunpeng Huang, Yaonan Gu, Jingwei Xu, Zhihong Zhu, Zhaorun Chen, Xiaoxing Ma
Abstract: as foundation models (fms) continue to shape the landscape of ai, the in-context learning (icl) paradigm thrives but also encounters issues such as toxicity, hallucination, disparity, adversarial vulnerability, and inconsistency. ensuring the reliability and responsibility of fms is crucial for the sustainable development of the ai ecosystem. in this concise overview, we investigate recent advancements in enhancing the reliability and trustworthiness of fms within icl frameworks, focusing on four key methodologies, each with its corresponding subgoals. we sincerely hope this paper can provide valuable insights for researchers and practitioners endeavoring to build safe and dependable fms and foster a stable and consistent icl environment, thereby unlocking their vast potential.
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons
Abstract: past analyses of reinforcement learning from human feedback (rlhf) assume that the human fully observes the environment. what happens when human feedback is based only on partial observations? we formally define two failure cases: deception and overjustification. modeling the human as boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which rlhf is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. to help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. in some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. we caution against blindly applying rlhf in partially observable settings and propose research directions to help tackle these challenges.
Shaolei Zhang, Tian Yu, Yang Feng
Abstract: large language models (llms) have demonstrated remarkable capabilities across various tasks. however, they sometimes suffer from producing hallucinations, particularly in cases where they may generate untruthful responses despite possessing the correct knowledge. in this paper, we propose truthx, an inference-time method to elicit the truthfulness of llms by editing their internal representations in truthful space. truthx employs an auto-encoder to map llm's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. during inference, by editing llm's internal representations in truthful space, truthx effectively enhances the truthfulness of llms. experiments show that truthx effectively improves the truthfulness of 13 advanced llms by an average of 20% on truthfulqa benchmark. further analyses suggest that the truthful space acquired by truthx plays a pivotal role in controlling llm to produce truthful or hallucinatory responses.
Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, Manuel Gomez Rodriguez
Abstract: large language models are often ranked according to their level of alignment with human preferences -- a model is better than other models if its outputs are more frequently preferred by humans. one of the most popular ways to elicit human preferences utilizes pairwise comparisons between the outputs provided by different models to the same inputs. however, since gathering pairwise comparisons by humans is costly and time-consuming, it has become a very common practice to gather pairwise comparisons by a strong large language model -- a model strongly aligned with human preferences. surprisingly, practitioners cannot currently measure the uncertainty that any mismatch between human and model preferences may introduce in the constructed rankings. in this work, we develop a statistical framework to bridge this gap. given a small set of pairwise comparisons by humans and a large set of pairwise comparisons by a model, our framework provides a rank-set -- a set of possible ranking positions -- for each of the models under comparison. moreover, it guarantees that, with a probability greater than or equal to a user-specified value, the rank-sets cover the true ranking consistent with (the distribution of) human pairwise preferences. our framework is computationally efficient, easy to use, and does not make any assumption about the distribution of human preferences nor about the degree of alignment between the pairwise comparisons by the humans and the strong large language model.
Zhenting Qi, Hanlin Zhang, Eric Xing, Sham Kakade, Himabindu Lakkaraju
Abstract: retrieval-augmented generation (rag) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. we study the risk of datastore leakage in retrieval-in-context rag language models (lms). we show that an adversary can exploit lms' instruction-following capabilities to easily extract text data verbatim from the datastore of rag systems built with instruction-tuned lms via prompt injection. the vulnerability exists for a wide range of modern lms that span llama2, mistral/mixtral, vicuna, solar, wizardlm, qwen1.5, and platypus2, and the exploitability exacerbates as the model size scales up. extending our study to production rag models gpts, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized gpts with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the gpts with only 100 queries generated by themselves.
Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra
Abstract: large language models (llms) have significantly transformed the educational landscape. as current plagiarism detection tools struggle to keep pace with llms' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of llms. in this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by llms. focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause llms to produce incorrect answers by simply editing the numeric values in the problems. we conduct experiments on various open- and closed-source llms, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. we identify shared vulnerabilities among llms and propose a cost-effective approach to attack high-cost models. additionally, we conduct automatic analysis on math problems and investigate the cause of failure to guide future research on llm's mathematical capability.
Ruisi Zhang, Farinaz Koushanfar
Abstract: this paper introduces emmark,a novel watermarking framework for protecting the intellectual property (ip) of embedded large language models deployed on resource-constrained edge devices. to address the ip theft risks posed by malicious end-users, emmark enables proprietors to authenticate ownership by querying the watermarked model weights and matching the inserted signatures. emmark's novelty lies in its strategic watermark weight parameters selection, nsuring robustness and maintaining model quality. extensive proof-of-concept evaluations of models from opt and llama-2 families demonstrate emmark's fidelity, achieving 100% success in watermark extraction with model performance preservation. emmark also showcased its resilience against watermark removal and forging attacks.
Jun Huang, Jiawei Zhang, Qi Wang, Weihong Han, Yanchun Zhang
Abstract: large language models (llms) represent an advanced evolution of earlier, simpler language models. they boast enhanced abilities to handle complex language patterns and generate coherent text, images, audios, and videos. furthermore, they can be fine-tuned for specific tasks. this versatility has led to the proliferation and extensive use of numerous commercialized large models. however, the rapid expansion of llms has raised security and ethical concerns within the academic community. this emphasizes the need for ongoing research into security evaluation during their development and deployment. over the past few years, a substantial body of research has been dedicated to the security evaluation of large-scale models. this article an in-depth review of the most recent advancements in this field, providing a comprehensive analysis of commonly used evaluation metrics, advanced evaluation frameworks, and the routine evaluation processes for llms. furthermore, we also discuss the future directions for advancing the security evaluation of llms.
Fan Yin, Jayanth Srinivasa, Kai-Wei Chang
Abstract: we study how to characterize and predict the truthfulness of texts generated from large language models (llms), which serves as a crucial step in building trust between humans and llms. although several approaches based on entropy or verbalized uncertainty have been proposed to calibrate model predictions, these methods are often intractable, sensitive to hyperparameters, and less reliable when applied in generative tasks with llms. in this paper, we suggest investigating internal activations and quantifying llm's truthfulness using the local intrinsic dimension (lid) of model activations. through experiments on four question answering (qa) datasets, we demonstrate the effectiveness ohttps://info.arxiv.org/help/prep#abstractsf our proposed method. additionally, we study intrinsic dimensions in llms and their relations with model layers, autoregressive language modeling, and the training of llms, revealing that intrinsic dimensions can be a powerful approach to understanding llms.

2024-02-26

Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
Abstract: approaches to aligning large language models (llms) with human values has focused on correcting misalignment that emerges from pretraining. however, this focus overlooks another source of misalignment: bad actors might purposely fine-tune llms to achieve harmful goals. in this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. however, lacking in previous works is a clear presentation of the conditions for effective defence. we propose a set of conditions for effective defence against harmful fine-tuning in llms called "immunization conditions," which help us understand how we would construct and measure future defences. using this formal framework for defence, we offer a synthesis of different research directions that might be persued to prevent harmful fine-tuning attacks and provide a demonstration of how to use these conditions experimentally showing early results of using an adversarial loss to immunize llama2-7b-chat.
Yuansen Zhang, Xiao Wang, Zhiheng Xi, Han Xia, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: large language models (llms) have showcased remarkable capabilities in following human instructions. however, recent studies have raised concerns about the robustness of llms when prompted with instructions combining textual adversarial samples. in this paper, drawing inspiration from recent works that llms are sensitive to the design of the instructions, we utilize instructions in code style, which are more structural and less ambiguous, to replace typically natural language instructions. through this conversion, we provide llms with more precise instructions and strengthen the robustness of llms. moreover, under few-shot scenarios, we propose a novel method to compose in-context demonstrations using both clean and adversarial samples (\textit{adversarial context method}) to further boost the robustness of the llms. experiments on eight robustness datasets show that our method consistently outperforms prompting llms with natural language instructions. for example, with gpt-3.5-turbo, our method achieves an improvement of 5.68\% in test set accuracy and a reduction of 5.66 points in attack success rate (asr).
Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang
Abstract: the safety of large language models (llms) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within llms' responses in an aligned, customizable and explainable manner. in this paper, we propose shieldlm, an llm-based safety detector, which aligns with general human safety standards, supports customizable detection rules, and provides explanations for its decisions. to train shieldlm, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. through extensive experiments, we demonstrate that shieldlm surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. besides performing well on standard detection datasets, shieldlm has also been shown to be effective in real-world situations as a safety evaluator for advanced llms. we release shieldlm at \url{https://github.com/thu-coai/shieldlm} to support accurate and explainable safety detection under various safety standards, contributing to the ongoing efforts to enhance the safety of llms.
Peiling Yi, Arkaitz Zubiaga
Abstract: swear words are a common proxy to collect datasets with cyberbullying incidents. our focus is on measuring and mitigating biases derived from spurious associations between swear words and incidents occurring as a result of such data collection strategies. after demonstrating and quantifying these biases, we introduce id-xcb, the first data-independent debiasing technique that combines adversarial training, bias constraints and debias fine-tuning approach aimed at alleviating model attention to bias-inducing words without impacting overall model performance. we explore id-xcb on two popular session-based cyberbullying datasets along with comprehensive ablation and generalisation studies. we show that id-xcb learns robust cyberbullying detection capabilities while mitigating biases, outperforming state-of-the-art debiasing methods in both performance and bias mitigation. our quantitative and qualitative analyses demonstrate its generalisability to unseen data.
Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh
Abstract: although many large language models (llms) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks, which rewrite the original prompt to conceal its harmful intent. in this paper, we propose a new method for defending llms against jailbreaking attacks by ``backtranslation''. specifically, given an initial response generated by the target llm from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. the inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the llm's response and is not directly manipulated by the attacker. we then run the target llm again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. we explain that the proposed defense provides several benefits on its effectiveness and efficiency. we empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts.
Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for large language models (llms). this paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned llms: intent security recognition followed by response generation. grounded in this hypothesis, we propose codechameleon, a novel jailbreak framework based on personalized encryption tactics. to elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. to guarantee response generation functionality, we embed a decryption function within the instructions, which allows the llm to decrypt and execute the encrypted queries successfully. we conduct extensive experiments on 7 llms, achieving state-of-the-art average attack success rate (asr). remarkably, our method achieves an 86.6\% asr on gpt-4-1106.
Paul Röttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Rose Kirk, Hinrich Schütze, Dirk Hovy
Abstract: much recent work seeks to evaluate values and opinions in large language models (llms) using multiple-choice surveys and questionnaires. most of this work is motivated by concerns around real-world llm applications. for example, politically-biased llms may subtly influence society when they are used by millions of people. such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask llms survey questions. motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in llms and explore more realistic unconstrained evaluations. as a case study, we focus on the popular political compass test (pct). in a systematic review, we find that most prior work using the pct forces models to comply with the pct's multiple-choice format. we show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. we distill these findings into recommendations and open challenges in evaluating values and opinions in llms.
Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
Abstract: as large language models (llms) become increasingly prevalent across many real-world applications, understanding and enhancing their robustness to user inputs is of paramount importance. existing methods for identifying adversarial prompts tend to focus on specific domains, lack diversity, or require extensive human annotations. to address these limitations, we present rainbow teaming, a novel approach for producing a diverse collection of adversarial prompts. rainbow teaming casts adversarial prompt generation as a quality-diversity problem, and uses open-ended search to generate prompts that are both effective and diverse. it can uncover a model's vulnerabilities across a broad range of domains including, in this paper, safety, question answering, and cybersecurity. we also demonstrate that fine-tuning on synthetic data generated by rainbow teaming improves the safety of state-of-the-art llms without hurting their general capabilities and helpfulness, paving the path to open-ended self-improvement.
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
Abstract: machine unlearning can be useful for removing harmful capabilities and memorized text from large language models (llms), but there are not yet standardized methods for rigorously evaluating it. in this paper, we first survey techniques and limitations of existing unlearning evaluations. second, we apply a comprehensive set of tests for the robustness and competitiveness of unlearning in the "who's harry potter" (whp) model from eldan and russinovich (2023). while whp's unlearning generalizes well when evaluated with the "familiarity" metric from eldan and russinovich, we find i) higher-than-baseline amounts of knowledge can reliably be extracted, ii) whp performs on par with the original model on harry potter q&a tasks, iii) it represents latent knowledge comparably to the original model, and iv) there is collateral unlearning in related domains. overall, our results highlight the importance of comprehensive unlearning evaluation that avoids ad-hoc metrics.
Fangzhou Wu, Shutong Wu, Yulong Cao, Chaowei Xiao
Abstract: with the fast development of large language models (llms), llm-driven web agents (web agents for short) have obtained tons of attention due to their superior capability where llms serve as the core part of making decisions like the human brain equipped with multiple web tools to actively interact with external deployed websites. as uncountable web agents have been released and such llm systems are experiencing rapid development and drawing closer to widespread deployment in our daily lives, an essential and pressing question arises: "are these web agents secure?". in this paper, we introduce a novel threat, wipi, that indirectly controls web agent to execute malicious instructions embedded in publicly accessible webpages. to launch a successful wipi works in a black-box environment. this methodology focuses on the form and content of indirect instructions within external webpages, enhancing the efficiency and stealthiness of the attack. to evaluate the effectiveness of the proposed methodology, we conducted extensive experiments using 7 plugin-based chatgpt web agents, 8 web gpts, and 3 different open-source web agents. the results reveal that our methodology achieves an average attack success rate (asr) exceeding 90% even in pure black-box scenarios. moreover, through an ablation study examining various user prefix instructions, we demonstrated that the wipi exhibits strong robustness, maintaining high performance across diverse prefix instructions.
Gabriel De Jesus Coelho Da Silva, Carlos Becker Westphall
Abstract: large language models (llms) have quickly risen to prominence due to their ability to perform at or close to the state-of-the-art in a variety of fields while handling natural language. an important field of research is the application of such models at the cybersecurity context. this survey aims to identify where in the field of cybersecurity llms have already been applied, the ways in which they are being used and their limitations in the field. finally, suggestions are made on how to improve such limitations and what can be expected from these systems once these limitations are overcome.
Juan Felipe Gomez, Caio Vieira Machado, Lucas Monteiro Paes, Flavio P. Calmon
Abstract: machine learning (ml) is widely used to moderate online content. despite its scalability relative to human moderation, the use of ml introduces unique challenges to content moderation. one such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. this multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. we experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. we discuss these findings in terms of human rights set out by the international covenant on civil and political rights (iccpr), namely freedom of expression, non-discrimination, and procedural justice. we analyze (i) the extent of predictive multiplicity among state-of-the-art llms used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. to mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the digital services act in the european union, the online safety act in the united kingdom, and the fake news bill in brazil.
Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel
Abstract: in this paper we undertake a systematic study of privacy attacks against open source large language models (llms), where an adversary has access to either the model weights, gradients, or losses, and tries to exploit them to learn something about the underlying training data. our headline results are the first membership inference attacks (mias) against pre-trained llms that are able to simultaneously achieve high tprs and low fprs, and a pipeline showing that over $50\%$ (!) of the fine-tuning dataset can be extracted from a fine-tuned llm in natural settings. we consider varying degrees of access to the underlying model, customization of the language model, and resources available to the attacker. in the pre-trained setting, we propose three new white-box mias: an attack based on the gradient norm, a supervised neural network classifier, and a single step loss ratio attack. all outperform existing black-box baselines, and our supervised attack closes the gap between mia attack success against llms and other types of models. in fine-tuning, we find that given access to the loss of the fine-tuned and base models, a fine-tuned loss ratio attack flora is able to achieve near perfect mia peformance. we then leverage these mias to extract fine-tuning data from fine-tuned language models. we find that the pipeline of generating from fine-tuned models prompted with a small snippet of the prefix of each training example, followed by using flora to select the most likely training sample, succeeds the majority of the fine-tuning dataset after only $3$ epochs of fine-tuning. taken together, these findings show that highly effective mias are available in almost all llm training settings, and highlight that great care must be taken before llms are fine-tuned on highly sensitive data and then deployed.
Juyeon Kim, Jeongeun Lee, Yoonho Chang, Chanyeol Choi, Junseong Kim, Jy-Yong Sohn
Abstract: mitigating hallucination issues is one of the main challenges of llms we need to overcome, in order to reliably use them in real-world scenarios. recently, various methods are proposed to check the factual errors in the llm-generated texts and revise them accordingly, to reduce the hallucination issue. in this paper, we propose re-ex, a method of revising llm-generated texts, which introduces a novel step dubbed as the factual error explanation step. re-ex revises the initial response of llms using 3-steps: first, external tools are used to get the evidences on the factual errors in the response; second, llms are instructed to explain the problematic parts of the response based on the evidences gathered in the first step; finally, llms revise the response using the explanation obtained in the second step. in addition to the explanation step, we propose new prompting techniques to reduce the amount of tokens and wall-clock time required for the response revision process. compared with existing methods including factool, cove, and rarr, re-ex provides better revision performance with less time and fewer tokens in multiple benchmarks.
Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin Yao, Dong Yu, Tongshuang Wu, Jianshu Chen
Abstract: for a llm to be trustworthy, its confidence level should be well-calibrated with its actual performance. while it is now common sense that llm performances are greatly impacted by prompts, the confidence calibration in prompting llms has yet to be thoroughly explored. in this paper, we explore how different prompting strategies influence llm confidence calibration and how it could be improved. we conduct extensive experiments on six prompting methods in the question-answering context and we observe that, while these methods help improve the expected llm calibration, they also trigger llms to be over-confident when responding to some instances. inspired by human cognition, we propose fact-and-reflection (far) prompting, which improves the llm calibration in two steps. first, far elicits the known "facts" that are relevant to the input prompt from the llm. and then it asks the model to "reflect" over them to generate the final answer. experiments show that far prompting achieves significantly better calibration; it lowers the expected calibration error by 23.5% on our multi-purpose qa tasks. notably, far prompting even elicits the capability of verbally expressing concerns in less confident scenarios, which helps trigger retrieval augmentation for solving these harder instances.

2024-02-25

Hao Wang, Hao Li, Minlie Huang, Lei Sha
Abstract: the safety defense methods of large language models(llms) stays limited because the dangerous prompts are manually curated to just few known attack types, which fails to keep pace with emerging varieties. recent studies found that attaching suffixes to harmful instructions can hack the defense of llms and lead to dangerous outputs. this method, while effective, leaves a gap in understanding the underlying mechanics of such adversarial suffix due to the non-readability and it can be relatively easily seen through by common defense methods such as perplexity filters.to cope with this challenge, in this paper, we propose an adversarial suffixes embedding translation framework(asetf) that are able to translate the unreadable adversarial suffixes into coherent, readable text, which makes it easier to understand and analyze the reasons behind harmful content generation by large language models. we conducted experiments on llms such as llama2, vicuna and using the advbench dataset's harmful instructions. the results indicate that our method achieves a much better attack success rate to existing techniques, while significantly enhancing the textual fluency of the prompts. in addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack multiple llms, even black-box llms, such as chatgpt and gemini. as a result, the prompts generated through our method exhibit enriched semantic diversity, which potentially provides more adversarial examples for llm defense methods.
Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu
Abstract: while reinforcement learning from human feedback (rlhf) significantly enhances the generation quality of large language models (llms), recent studies have raised concerns regarding the complexity and instability associated with the proximal policy optimization (ppo) algorithm, proposing a series of order-based calibration methods as viable alternatives. this paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. building upon these findings, we propose a novel \textbf{v}alue-based \textbf{c}ali\textbf{b}ration (vcb) method to better align llms with human preferences. experimental results demonstrate that vcb surpasses existing alignment methods on ai assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.
Shuhai Zhang, Yiliao Song, Jiahao Yang, Yuanqing Li, Bo Han, Mingkui Tan
Abstract: large language models (llms) such as chatgpt have exhibited remarkable performance in generating human-like texts. however, machine-generated texts (mgts) may carry critical risks, such as plagiarism issues, misleading information, or hallucination issues. therefore, it is very urgent and important to detect mgts in many situations. unfortunately, it is challenging to distinguish mgts and human-written texts because the distributional discrepancy between them is often very subtle due to the remarkable performance of llms. in this paper, we seek to exploit \textit{maximum mean discrepancy} (mmd) to address this issue in the sense that mmd can well identify distributional discrepancies. however, directly training a detector with mmd using diverse mgts will incur a significantly increased variance of mmd since mgts may contain \textit{multiple text populations} due to various llms. this will severely impair mmd's ability to measure the difference between two samples. to tackle this, we propose a novel \textit{multi-population} aware optimization method for mmd called mmd-mp, which can \textit{avoid variance increases} and thus improve the stability to measure the distributional discrepancy. relying on mmd-mp, we develop two methods for paragraph-based and sentence-based detection, respectively. extensive experiments on various llms, \eg, gpt2 and chatgpt, show superior detection performance of our mmd-mp. the source code is available at \url{https://github.com/zshsh98/mmd-mp}.
Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith
Abstract: advances in generative models have made it possible for ai-generated text, code, and images to mirror human-generated content in many applications. watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating misuse of such ai-generated content. however, existing watermarking schemes remain surprisingly susceptible to attack. in particular, we show that desirable properties shared by existing llm watermarking systems such as quality preservation, robustness, and public detection apis can in turn make these systems vulnerable to various attacks. we rigorously study potential attacks in terms of common watermark design choices, and propose best practices and defenses for mitigation -- establishing a set of practical guidelines for embedding and detection of llm watermarks.
Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
Abstract: aligned large language models (llms) are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted llms and fool them into generating objectionable content. while initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. to meet this need, we propose semanticsmooth, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. experimental results demonstrate that semanticsmooth achieves state-of-the-art robustness against gcg, pair, and autodan attacks while maintaining strong nominal performance on instruction following benchmarks such as instructionfollowing and alpacaeval. the codes will be publicly available at https://github.com/ucsb-nlp-chang/semanticsmooth.
Cem Uluoglakci, Tugba Taskaya Temizel
Abstract: hallucinations pose a significant challenge to the reliability and alignment of large language models (llms), limiting their widespread acceptance beyond chatbot applications. despite ongoing efforts, hallucinations remain a prevalent challenge in llms. the detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. this paper introduces an automated scalable framework that combines benchmarking llms' hallucination tendencies with efficient hallucination detection. we leverage llms to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. the framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. we introduce the publicly available hypotermqa benchmarking dataset, on which state-of-the-art models' performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. the proposed framework provides opportunities to test and improve llms. additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, Percy Liang
Abstract: foundation models are critical digital technologies with sweeping societal impact that necessitates transparency. to codify how foundation model developers should provide transparency about the development and deployment of their models, we propose foundation model transparency reports, drawing upon the transparency reporting practices in social media. while external documentation of societal harms prompted social media transparency reports, our objective is to institutionalize transparency reporting for foundation models while the industry is still nascent. to design our reports, we identify 6 design principles given the successes and shortcomings of social media transparency reporting. to further schematize our reports, we draw upon the 100 transparency indicators from the foundation model transparency index. given these indicators, we measure the extent to which they overlap with the transparency requirements included in six prominent government policies (e.g., the eu ai act, the us executive order on safe, secure, and trustworthy ai). well-designed transparency reports could reduce compliance costs, in part due to overlapping regulatory requirements across different jurisdictions. we encourage foundation model developers to regularly publish transparency reports, building upon recommendations from the g7 and the white house.
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Abstract: the safety alignment of large language models (llms) is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger llms to output harmful content. however, current methods for jailbreaking llms, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned llms. this paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. we introduce an automatic prompt \textbf{d}ecomposition and \textbf{r}econstruction framework for jailbreak \textbf{attack} (drattack). drattack includes three key components: (a) `decomposition' of the original prompt into sub-prompts, (b) `reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `synonym search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking llms. an extensive empirical study across multiple open-source and closed-source llms demonstrates that, with a significantly reduced number of queries, drattack obtains a substantial gain of success rate over prior sota prompt-only attackers. notably, the success rate of 78.0\% on gpt-4 with merely 15 queries surpassed previous art by 33.1\%.

2024-02-24

Daoyuan Wu, Shuai Wang, Yang Liu, Ning Liu
Abstract: jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (llms). a considerable amount of research exists proposing more effective jailbreak attacks, including the recent greedy coordinate gradient (gcg) attack, jailbreak template-based attacks such as using "do-anything-now" (dan), and multilingual jailbreak. in contrast, the defensive side has been relatively less explored. this paper proposes a lightweight yet practical defense called selfdefend, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to llms, and we found that existing llms can effectively recognize such harmful prompts that violate their safety policies. based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "no" or a harmful prompt is output. the latter could also generate an explainable llm response to adversarial prompts. we demonstrate our idea of selfdefend works in various jailbreak scenarios through manual analysis in gpt-3.5/4. we also list three future directions to further enhance selfdefend.
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Abstract: large language models (llms) have emerged as a promising alternative to expensive human evaluations. however, the alignment and coverage of llm-based evaluations are often limited by the scope and potential bias of the evaluation prompts and criteria. to address this challenge, we propose hd-eval, a novel framework that iteratively aligns llm-based evaluators with human preference via hierarchical criteria decomposition. hd-eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of llm-based evaluators by decomposing a given evaluation task into finer-grained criteria, aggregating them according to estimated human preferences, pruning insignificant criteria with attribution, and further decomposing significant criteria. by integrating these steps within an iterative alignment training process, we obtain a hierarchical decomposition of criteria that comprehensively captures aspects of natural language at multiple levels of granularity. implemented as a white box, the human preference-guided aggregator is efficient to train and more explainable than relying solely on prompting, and its independence from model parameters makes it applicable to closed-source llms. extensive experiments on three evaluation domains demonstrate the superiority of hd-eval in further aligning state-of-the-art evaluators and providing deeper insights into the explanation of evaluation results and the task itself.
Timothy R. Mcintosh, Teo Susnjak, Tong Liu, Paul Watters, Raza Nowrozy, Malka N. Halgamuge
Abstract: this study investigated the integration readiness of four predominant cybersecurity governance, risk and compliance (grc) frameworks - nist csf 2.0, cobit 2019, iso 27001:2022, and the latest iso 42001:2023 - for the opportunities, risks, and regulatory compliance when adopting large language models (llms), using qualitative content analysis and expert validation. our analysis, with both llms and human experts in the loop, uncovered potential for llm integration together with inadequacies in llm risk oversight of those frameworks. comparative gap analysis has highlighted that the new iso 42001:2023, specifically designed for artificial intelligence (ai) management systems, provided most comprehensive facilitation for llm opportunities, whereas cobit 2019 aligned most closely with the impending european union ai act. nonetheless, our findings suggested that all evaluated frameworks would benefit from enhancements to more effectively and more comprehensively address the multifaceted risks associated with llms, indicating a critical and time-sensitive need for their continuous evolution. we propose integrating human-expert-in-the-loop validation processes as crucial for enhancing cybersecurity frameworks to support secure and compliant llm integration, and discuss implications for the continuous evolution of cybersecurity grc frameworks to support the secure integration of llms.
Oliver Sourbut, Lewis Hammond, Harriet Wood
Abstract: many settings of interest involving humans and machines -- from virtual personal assistants to autonomous vehicles -- can naturally be modelled as principals (humans) delegating to agents (machines), which then interact with each other on their principals' behalf. we refer to these multi-principal, multi-agent scenarios as delegation games. in such games, there are two important failure modes: problems of control (where an agent fails to act in line their principal's preferences) and problems of cooperation (where the agents fail to work well together). in this paper we formalise and analyse these problems, further breaking them down into issues of alignment (do the players have similar preferences?) and capabilities (how competent are the players at satisfying those preferences?). we show -- theoretically and empirically -- how these measures determine the principals' welfare, how they can be estimated using limited observations, and thus how they might be used to help us design more aligned and cooperative ai systems.
Aleksa Sukovic, Goran Radanovic
Abstract: equipping agents with the capacity to justify made decisions using supporting evidence represents a cornerstone of accountable decision-making. furthermore, ensuring that justifications are in line with human expectations and societal norms is vital, especially in high-stakes situations such as healthcare. in this work, we propose the use of a debate-based reward model for reinforcement learning agents, where the outcome of a zero-sum debate game quantifies the justifiability of a decision in a particular state. this reward model is then used to train a justifiable policy, whose decisions can be more easily corroborated with supporting evidence. in the debate game, two argumentative agents take turns providing supporting evidence for two competing decisions. given the proposed evidence, a proxy of a human judge evaluates which decision is better justified. we demonstrate the potential of our approach in learning policies for prescribing and justifying treatment decisions of septic patients. we show that augmenting the reward with the feedback signal generated by the debate-based reward model yields policies highly favored by the judge when compared to the policy obtained solely from the environment rewards, while hardly sacrificing any performance. moreover, in terms of the overall performance and justifiability of trained policies, the debate-based feedback is comparable to the feedback obtained from an ideal judge proxy that evaluates decisions using the full information encoded in the state. this suggests that the debate game outputs key information contained in states that is most relevant for evaluating decisions, which in turn substantiates the practicality of combining our approach with human-in-the-loop evaluations. lastly, we showcase that agents trained via multi-agent debate learn to propose evidence that is resilient to refutations and closely aligns with human preferences.
Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash
Abstract: large language models (llms) are typically aligned to be harmless to humans. unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. more recent llms often incorporate an additional layer of defense, a guard model, which is a second llm that is designed to check and moderate the output response of the primary llm. our key contribution is to show a novel attack strategy, prp, that is successful against several open-source (e.g., llama 2) and closed-source (e.g., gpt 3.5) implementations of guard models. prp leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the guard model, and (b) propagating this prefix to the response. we find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the guard model at all. our work suggests that further advances are required on defenses and guard models before they can be considered effective.
Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Ge Li
Abstract: recent statements about the impressive capabilities of large language models (llms) are usually supported by evaluating on open-access benchmarks. considering the vast size and wide-ranging sources of llms' training data, it could explicitly or implicitly include test data, leading to llms being more susceptible to data contamination. however, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for llms faces significant challenges. in this paper, we propose cdd, which stands for contamination detection via output distribution for llms. cdd necessitates only the sampled texts to detect data contamination, by identifying the peakedness of llm's output distribution. to mitigate the impact of data contamination in evaluation, we also present ted: trustworthy evaluation via output distribution, based on the correction of llm's output distribution. to facilitate this study, we introduce two benchmarks, i.e., detcon and comieval, for data contamination detection and contamination mitigation evaluation tasks. extensive experimental results show that cdd achieves the average relative improvements of 21.8\%-30.2\% over other contamination detection approaches in terms of accuracy, f1 score, and auc metrics, and can effectively detect contamination caused by the variants of test data. ted significantly mitigates performance improvements up to 66.9\% attributed to data contamination across 24 settings and 21 contamination degrees. in real-world applications, we reveal that chatgpt exhibits a high potential to suffer from data contamination on humaneval benchmark.
Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan
Abstract: prior works on detoxification are scattered in the sense that they do not cover all aspects of detoxification needed in a real-world scenario. notably, prior works restrict the task of developing detoxification models to only a seen subset of platforms, leaving the question of how the models would perform on unseen platforms unexplored. additionally, these works do not address non-detoxifiability, a phenomenon whereby the toxic text cannot be detoxified without altering the meaning. we propose greenllama, the first comprehensive end-to-end detoxification framework, which attempts to alleviate the aforementioned limitations. we first introduce a cross-platform pseudo-parallel corpus applying multi-step data processing and generation strategies leveraging chatgpt. we then train a suite of detoxification models with our cross-platform corpus. we show that our detoxification models outperform the sota model trained with human-annotated parallel corpus. we further introduce explanation to promote transparency and trustworthiness. greenllama additionally offers a unique paraphrase detector especially dedicated for the detoxification task to tackle the non-detoxifiable cases. through experimental analysis, we demonstrate the effectiveness of our cross-platform corpus and the robustness of greenllama against adversarial toxicity.

2024-02-23

Zejun Zhang, Li Zhang, Xin Yuan, Anlan Zhang, Mengwei Xu, Feng Qian
Abstract: with the advancement of large language models (llms), increasingly sophisticated and powerful gpts are entering the market. despite their popularity, the llm ecosystem still remains unexplored. additionally, llms' susceptibility to attacks raises concerns over safety and plagiarism. thus, in this work, we conduct a pioneering exploration of gpt stores, aiming to study vulnerabilities and plagiarism within gpt applications. to begin with, we conduct, to our knowledge, the first large-scale monitoring and analysis of two stores, an unofficial gptstore.ai, and an official openai gpt store. then, we propose a trilevel gpt reversing (t-gr) strategy for extracting gpt internals. to complete these two tasks efficiently, we develop two automated tools: one for web scraping and another designed for programmatically interacting with gpts. our findings reveal a significant enthusiasm among users and developers for gpt interaction and creation, as evidenced by the rapid increase in gpts and their creators. however, we also uncover a widespread failure to protect gpt internals, with nearly 90% of system prompts easily accessible, leading to considerable plagiarism and duplication among gpts.
Heegyu Kim, Sehyun Yuk, Hyunsouk Cho
Abstract: caution: this paper includes offensive words that could potentially cause unpleasantness. language models (lms) are vulnerable to exploitation for adversarial misuse. training lms for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. we propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned lms and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. we've also observed that non-safety-aligned lms outperform safety-aligned lms in safety tasks by giving more helpful and safe responses. in conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety lm to be easily utilized in real-world service.
Xin Yi, Linlin Wang, Xiaoling Wang, Liang He
Abstract: impressive results have been achieved in natural language processing (nlp) tasks through the training of large language models (llms). however, these models occasionally produce toxic content such as insults, threats, and profanity in response to certain prompts, thereby constraining their practical utility. to tackle this issue, various finetuning-based and decoding-based approaches have been utilized to mitigate toxicity. however, these methods typically necessitate additional costs such as high-quality training data or auxiliary models. in this paper, we propose fine-grained detoxification via instance-level prefixes (fgdilp) to mitigate toxic text without additional cost. specifically, fgdilp contrasts the contextualized representation in attention space using a positive prefix-prepended prompt against multiple negative prefix-prepended prompts at the instance level. this allows for constructing fine-grained subtoxicity vectors, which enables collaborative detoxification by fusing them to correct the normal generation process when provided with a raw prompt. we validate that fgdilp enables controlled text generation with regard to toxicity at both the utterance and context levels. our method surpasses prompt-based baselines in detoxification, although at a slight cost to generation fluency and diversity.
Yiping Jin, Leo Wanner, Alexander Shvets
Abstract: online hate detection suffers from biases incurred in data sampling, annotation, and model pre-training. therefore, measuring the averaged performance over all examples in held-out test data is inadequate. instead, we must identify specific model weaknesses and be informed when it is more likely to fail. a recent proposal in this direction is hatecheck, a suite for testing fine-grained model functionalities on synthesized data generated using templates of the kind "you are just a [slur] to me." however, despite enabling more detailed diagnostic insights, the hatecheck test cases are often generic and have simplistic sentence structures that do not match the real-world data. to address this limitation, we propose gpt-hatecheck, a framework to generate more diverse and realistic functional tests from scratch by instructing large language models (llms). we employ an additional natural language inference (nli) model to verify the generations. crowd-sourced annotation demonstrates that the generated test cases are of high quality. using the new functional tests, we can uncover model weaknesses that would be overlooked using the original hatecheck dataset.
Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee
Abstract: in this study, we tackle a growing concern around the safety and ethical use of large language models (llms). despite their potential, these models can be tricked into producing harmful or unethical content through various sophisticated methods, including 'jailbreaking' techniques and targeted manipulation. our work zeroes in on a specific issue: to what extent llms can be led astray by asking them to generate responses that are instruction-centric such as a pseudocode, a program or a software snippet as opposed to vanilla text. to investigate this question, we introduce techhazardqa, a dataset containing complex queries which should be answered in both text and instruction-centric formats (e.g., pseudocodes), aimed at identifying triggers for unethical responses. we query a series of llms -- llama-2-13b, llama-2-7b, mistral-v2 and mistral 8x7b -- and ask them to generate both text and instruction-centric responses. for evaluation we report the harmfulness score metric as well as judgements from gpt-4 and humans. overall, we observe that asking llms to produce instruction-centric responses enhances the unethical response generation by ~2-38% across the models. as an additional objective, we investigate the impact of model editing using the rome technique, which further increases the propensity for generating undesirable content. in particular, asking edited llms to generate instruction-centric responses further increases the unethical response generation by ~3-16% across the different models.
Zijie J. Wang, Chinmay Kulkarni, Lauren Wilcox, Michael Terry, Michael Madaio
Abstract: prompt-based interfaces for large language models (llms) have made prototyping and building ai-powered applications easier than ever before. however, identifying potential harms that may arise from ai applications remains a challenge, particularly during prompt-based prototyping. to address this, we present farsight, a novel in situ interactive tool that helps people identify potential harms from the ai applications they are prototyping. based on a user's prompt, farsight highlights news articles about relevant ai incidents and allows users to explore and edit llm-generated use cases, stakeholders, and harms. we report design insights from a co-design study with 10 ai prototypers and findings from a user study with 42 ai prototypers. after using farsight, ai prototypers in our user study are better able to independently identify potential harms associated with a prompt and find our tool more useful and usable than existing resources. their qualitative feedback also highlights that farsight encourages them to focus on end-users and think beyond immediate harms. we discuss these findings and reflect on their implications for designing ai prototyping experiences that meaningfully engage with ai harms. farsight is publicly accessible at: https://pair-code.github.io/farsight.
Yiran Liu, Ke Yang, Zehan Qi, Xiao Liu, Yang Yu, Chengxiang Zhai
Abstract: the growing integration of large language models (llms) into social operations amplifies their impact on decisions in crucial areas such as economics, law, education, and healthcare, raising public concerns about these models' discrimination-related safety and reliability. however, prior discrimination measuring frameworks solely assess the average discriminatory behavior of llms, often proving inadequate due to the overlook of an additional discrimination-leading factor, i.e., the llms' prediction variation across diverse contexts. in this work, we present the prejudice-caprice framework (pcf) that comprehensively measures discrimination in llms by considering both their consistently biased preference and preference variation across diverse contexts. specifically, we mathematically dissect the aggregated contextualized discrimination risk of llms into prejudice risk, originating from llms' persistent prejudice, and caprice risk, stemming from their generation inconsistency. in addition, we utilize a data-mining approach to gather preference-detecting probes from sentence skeletons, devoid of attribute indications, to approximate llms' applied contexts. while initially intended for assessing discrimination in llms, our proposed pcf facilitates the comprehensive and flexible measurement of any inductive biases, including knowledge alongside prejudice, across various modality models. we apply our discrimination-measuring framework to 12 common llms, yielding intriguing findings: i) modern llms demonstrate significant pro-male stereotypes, ii) llms' exhibited discrimination correlates with several social and economic factors, iii) prejudice risk dominates the overall discrimination risk and follows a normal distribution, and iv) caprice risk contributes minimally to the overall risk but follows a fat-tailed distribution, suggesting that it is wild risk requiring enhanced surveillance.
Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
Abstract: in this paper, we introduce a novel class of fast, beam search-based adversarial attack (beast) for language models (lms). beast employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. the computational efficiency of beast facilitates us to investigate its applications on lms for jailbreaking, eliciting hallucinations, and privacy attacks. our gradient-free targeted attack can jailbreak aligned lms with high attack success rates within one minute. for instance, beast can jailbreak vicuna-7b-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single nvidia rtx a6000 48gb gpu. additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in lm chatbots. through human evaluations, we find that our untargeted attack causes vicuna-7b-v1.5 to produce ~15% more incorrect outputs when compared to lm outputs in the absence of our attack. we also learn that 22% of the time, beast causes vicuna to generate outputs that are not relevant to the original prompt. further, we use beast to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for lms. we believe that our fast attack, beast, has the potential to accelerate research in lm security and privacy. our codebase is publicly available at https://github.com/vinusankars/beast.
Ante Wang, Linfeng Song, Baolin Peng, Ye Tian, Lifeng Jin, Haitao Mi, Jinsong Su, Dong Yu
Abstract: this work studies improving large language model (llm) generations at inference time by mitigating fact-conflicting hallucinations. particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. compared with prior ensemble methods (wang et al., 2022;chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. our approach can broadly benefit smaller and open-source llms as it mainly conducts simple content-based comparisons. experiments on biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of llms. besides, comprehensive analyses on triviaqa and gsm8k demonstrate the potential of self-endorsement for broader application.
Zihan Zhou, Jonathan Booher, Wei Liu, Aleksandr Petiushko, Animesh Garg
Abstract: safe reinforcement learning tasks with multiple constraints are a challenging domain despite being very common in the real world. to address this challenge, we propose objective suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic. we benchmark objective suppression in two multi-constraint safety domains, including an autonomous driving domain where any incorrect behavior can lead to disastrous consequences. empirically, we demonstrate that our proposed method, when combined with existing safe rl algorithms, can match the task reward achieved by our baselines with significantly fewer constraint violations.
Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen
Abstract: large language models (llms) have gradually become the gateway for people to acquire new knowledge. however, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." previous studies have shown the weakness of current llms when confronted with such jailbreaking attacks. nevertheless, comprehension of the intrinsic decision-making mechanism within the llms upon receipt of jailbreak prompts is noticeably lacking. our research provides a psychological explanation of the jailbreak prompts. drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the llm to achieve cognitive coordination in an erroneous direction. further, we propose an automatic black-box jailbreaking method based on the foot-in-the-door (fitd) technique. this method progressively induces the model to answer harmful questions via multi-step incremental prompts. we instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced llms, yielding an average success rate of 83.9%. this study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of llms.
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yue Xing, Yiding Liu, Han Xu, Jie Ren, Shuaiqiang Wang, Dawei Yin, Yi Chang, Jiliang Tang
Abstract: retrieval-augmented generation (rag) is a powerful technique to facilitate language model with proprietary and private data, where data privacy is a pivotal concern. whereas extensive research has demonstrated the privacy risks of large language models (llms), the rag technique could potentially reshape the inherent behaviors of llm generation, posing new privacy issues that are currently under-explored. in this work, we conduct extensive empirical studies with novel attack methods, which demonstrate the vulnerability of rag systems on leaking the private retrieval database. despite the new risk brought by rag on the retrieval data, we further reveal that rag can mitigate the leakage of the llms' training data. overall, we provide new insights in this paper for privacy protection of retrieval-augmented llms, which benefit both llms and rag systems builders. our code is available at https://github.com/phycholosogy/rag-privacy.

2024-02-22

Ang Li, Jingqian Zhao, Bin Liang, Lin Gui, Hui Wang, Xi Zeng, Kam-Fai Wong, Ruifeng Xu
Abstract: large language models (llms) have achieved remarkable progress in many natural language processing tasks. however, our experiment reveals that, in stance detection tasks, llms may generate biased stances due to spurious sentiment-stance correlation and preference towards certain individuals and topics, thus harming their performance. therefore, in this paper, we propose to mitigate biases of llms in stance detection with calibration (mb-cal). in which, a novel gated calibration network is devised to mitigate the biases on the stance reasoning results from llms. further, to make the calibration more accurate and generalizable, we construct counterfactual augmented data to rectify stance biases. experimental results on in-target and zero-shot stance detection tasks show that the proposed mb-cal can effectively mitigate biases of llms, achieving state-of-the-art results.
Chen Jia
Abstract: preference learning (pl) with large language models (llms) aims to align the llms' generations with human preferences. previous work on reinforcement learning from human feedback (rlhf) has demonstrated promising results in in-distribution pl. however, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. thus, out-of-distribution (ood) pl is practically useful for enhancing the generalization ability of llms with limited preference feedback. this work addresses ood pl by optimizing a general reward model through a meta-learning approach. during meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. when encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for pl. we theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.
Yuzhe Yang, Yujia Liu, Xin Liu, Avanti Gulhane, Domenico Mastrodicasa, Wei Wu, Edward J Wang, Dushyant W Sahani, Shwetak Patel
Abstract: advances in artificial intelligence (ai) have achieved expert-level performance in medical imaging applications. notably, self-supervised vision-language foundation models can detect a broad spectrum of pathologies without relying on explicit training annotations. however, it is crucial to ensure that these ai models do not mirror or amplify human biases, thereby disadvantaging historically marginalized groups such as females or black patients. the manifestation of such biases could systematically delay essential medical care for certain patient subgroups. in this study, we investigate the algorithmic fairness of state-of-the-art vision-language foundation models in chest x-ray diagnosis across five globally-sourced datasets. our findings reveal that compared to board-certified radiologists, these foundation models consistently underdiagnose marginalized groups, with even higher rates seen in intersectional subgroups, such as black female patients. such demographic biases present over a wide range of pathologies and demographic attributes. further analysis of the model embedding uncovers its significant encoding of demographic information. deploying ai systems with these biases in medical imaging can intensify pre-existing care disparities, posing potential challenges to equitable healthcare access and raising ethical questions about their clinical application.
Priyanshul Govil, Vamshi Krishna Bonagiri, Manas Gaur, Ponnurangam Kumaraguru, Sanorita Dey
Abstract: large language models (llms) are trained on inherently biased data. previous works on debiasing models rely on benchmark datasets to measure model performance. however, these datasets suffer from several pitfalls due to the extremely subjective understanding of bias, highlighting a critical need for contextual exploration. we propose understanding the context of user inputs with consideration of the diverse situations in which input statements are possible. this approach would allow for frameworks that foster bias awareness rather than guardrails that hurt user engagement. our contribution is twofold: (i) we create a dataset of 2287 stereotyped statements augmented with points for adding context; (ii) we develop the context-oriented bias indicator and assessment score (cobias) to assess statements' contextual reliability in measuring bias. our metric is a significant predictor of the contextual reliability of bias-benchmark datasets ($\chi^2=71.02, p<2.2 \cdot 10^{-16})$. cobias can be used to create reliable datasets, resulting in an improvement in bias mitigation works.
Oliver Bentham, Nathan Stringham, Ana Marasović
Abstract: understanding the extent to which chain-of-thought (cot) generations align with a large language model's (llm) internal computations is critical for deciding whether to trust an llm's output. as a proxy for cot faithfulness, arxiv:2307.13702 propose a metric that measures a model's dependence on its cot for producing an answer. within a single family of proprietary models, they find that llms exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. we evaluate whether these results generalize as a property of all llms. we replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for cot faithfulness they report. however, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. the faithfulness metric is also highly correlated ($r^2$ = 0.91) with accuracy, raising doubts about its validity as a construct for evaluating faithfulness.
Zefeng Wang, Zhen Han, Shuo Chen, Fan Xue, Zifeng Ding, Xun Xiao, Volker Tresp, Philip Torr, Jindong Gu
Abstract: recently, multimodal llms (mllms) have shown a great ability to understand images. however, like traditional vision models, they are still vulnerable to adversarial images. meanwhile, chain-of-thought (cot) reasoning has been widely explored on mllms, which not only improves model's performance, but also enhances model's explainability by giving intermediate reasoning steps. nevertheless, there is still a lack of study regarding mllms' adversarial robustness with cot and an understanding of what the rationale looks like when mllms infer wrong answers with adversarial images. our research evaluates the adversarial robustness of mllms when employing cot reasoning, finding that cot marginally improves adversarial robustness against existing attack methods. moreover, we introduce a novel stop-reasoning attack technique that effectively bypasses the cot-induced robustness enhancements. finally, we demonstrate the alterations in cot reasoning when mllms confront adversarial images, shedding light on their reasoning process under adversarial attacks.
Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick Mcdaniel, Muhao Chen, Bo Li, Chaowei Xiao
Abstract: despite the general capabilities of large language models (llms) like gpt-4 and llama-2, these models still request fine-tuning or adaptation with customized data when it comes to meeting the specific business demands and intricacies of tailored use cases. however, this process inevitably introduces new safety threats, particularly against the fine-tuning based jailbreak attack (fjattack), where incorporating just a few harmful examples into the fine-tuning dataset can significantly compromise the model safety. though potential defenses have been proposed by incorporating safety examples into the fine-tuning dataset to reduce the safety issues, such approaches require incorporating a substantial amount of safety examples, making it inefficient. to effectively defend against the fjattack with limited safety examples, we propose a backdoor enhanced safety alignment method inspired by an analogy with the concept of backdoor attacks. in particular, we construct prefixed safety examples by integrating a secret prompt, acting as a "backdoor trigger", that is prefixed to safety examples. our comprehensive experiments demonstrate that through the backdoor enhanced safety alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned llms will achieve similar safety performance as the original aligned models. furthermore, we also explore the effectiveness of our method in a more practical setting where the fine-tuning data consists of both fjattack examples and the fine-tuning task data. our method shows great efficacy in defending against fjattack without harming the performance of fine-tuning tasks.
Victoria Lin, Eli Ben-Michael, Louis-Philippe Morency
Abstract: as large language models (llms) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. in this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. we first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. we formalize this causal language optimization problem, and we develop a method--causal preference optimization (cpo)--that solves an unbiased surrogate objective for the problem. we further extend cpo with doubly robust cpo (dr-cpo), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. finally, we empirically demonstrate the effectiveness of (dr-)cpo in optimizing state-of-the-art llms for human preferences on direct outcome data, and we validate the robustness of dr-cpo under difficult confounding conditions.
Michael J. Ryan, William Held, Diyi Yang
Abstract: before being deployed for user-facing applications, developers align large language models (llms) to user preferences through a variety of procedures, such as reinforcement learning from human feedback (rlhf) and direct preference optimization (dpo). current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. however, human preferences are not universal, and aligning to specific preference sets may have unintended effects. we explore how alignment impacts performance along three axes of global representation: english dialects, multilingualism, and opinions from and about countries worldwide. our results show that current alignment procedures create disparities between english dialects and global opinions. we find alignment improves capabilities in several languages. we conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.
Yang Deng, Yong Zhao, Moxin Li, See-Kiong Ng, Tat-Seng Chua
Abstract: despite the remarkable abilities of large language models (llms) to answer questions, they often display a considerable level of overconfidence even when the question does not have a definitive answer. to avoid providing hallucinated answers to these unknown questions, existing studies typically investigate approaches to refusing to answer these questions. in this work, we propose a novel and scalable self-alignment method to utilize the llm itself to enhance its response-ability to different types of unknown questions, being capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions. specifically, the self-align method first employ a two-stage class-aware self-augmentation approach to generate a large amount of unknown question-response data. then we conduct disparity-driven self-curation to select qualified data for fine-tuning the llm itself for aligning the responses to unknown questions as desired. experimental results on two datasets across four types of unknown questions validate the superiority of the self-align method over existing baselines in terms of three types of task formulation.
Yuwei Wu, Shijing Si, Yugui Zhang, Jiawen Gu, Jedrek Wosik
Abstract: email continues to be a pivotal and extensively utilized communication medium within professional and commercial domains. nonetheless, the prevalence of spam emails poses a significant challenge for users, disrupting their daily routines and diminishing productivity. consequently, accurately identifying and filtering spam based on content has become crucial for cybersecurity. recent advancements in natural language processing, particularly with large language models like chatgpt, have shown remarkable performance in tasks such as question answering and text generation. however, its potential in spam identification remains underexplored. to fill in the gap, this study attempts to evaluate chatgpt's capabilities for spam identification in both english and chinese email datasets. we employ chatgpt for spam email detection using in-context learning, which requires a prompt instruction and a few demonstrations. we also investigate how the training example size affects the performance of chatgpt. for comparison, we also implement five popular benchmark methods, including naive bayes, support vector machines (svm), logistic regression (lr), feedforward dense neural networks (dnn), and bert classifiers. though extensive experiments, the performance of chatgpt is significantly worse than deep supervised learning methods in the large english dataset, while it presents superior performance on the low-resourced chinese dataset, even outperforming bert in this case.

2024-02-21

Lingxi Zhang, Yue Yu, Kuan Wang, Chao Zhang
Abstract: retrieval-augmented generation enhances large language models (llms) by incorporating relevant information from external knowledge sources. this enables llms to adapt to specific domains and mitigate hallucinations in knowledge-intensive tasks. however, existing retrievers are often misaligned with llms due to their separate training processes and the black-box nature of llms. to address this challenge, we propose arl2, a retriever learning technique that harnesses llms as labelers. arl2 leverages llms to annotate and score relevant evidence, enabling learning the retriever from robust llm supervision. furthermore, arl2 uses an adaptive self-training strategy for curating high-quality and diverse relevance data, which can effectively reduce the annotation cost. extensive experiments demonstrate the effectiveness of arl2, achieving accuracy improvements of 5.4% on nq and 4.6% on mmlu compared to the state-of-the-art methods. additionally, arl2 exhibits robust transfer learning capabilities and strong zero-shot generalization abilities. our code will be published at \url{https://github.com/zhanglingxi-cs/arl2}.
Jiyoung Lee, Minwoo Kim, Seungho Kim, Junghwan Kim, Seunghyun Won, Hwaran Lee, Edward Choi
Abstract: for large language models (llms) to be effectively deployed in a specific country, they must possess an understanding of the nation's culture and basic knowledge. to this end, we introduce national alignment, which measures an alignment between an llm and a targeted country from two aspects: social value alignment and common knowledge alignment. social value alignment evaluates how well the model understands nation-specific social values, while common knowledge alignment examines how well the model captures basic knowledge related to the nation. we constructed kornat, the first benchmark that measures national alignment with south korea. for the social value dataset, we obtained ground truth labels from a large-scale survey involving 6,174 unique korean participants. for the common knowledge dataset, we constructed samples based on korean textbooks and ged reference materials. kornat contains 4k and 6k multiple-choice questions for social value and common knowledge, respectively. our dataset creation process is meticulously designed and based on statistical sampling theory and was refined through multiple rounds of human review. the experiment results of seven llms reveal that only a few models met our reference score, indicating a potential for further enhancement. kornat has received government approval after passing an assessment conducted by a government-affiliated organization dedicated to evaluating dataset quality. samples and detailed evaluation protocols of our dataset can be found in https://selectstar.ai/ko/papers-national-alignment
Da Yu, Peter Kairouz, Sewoong Oh, Zheng Xu
Abstract: service providers of large language model (llm) applications collect user instructions in the wild and use them in further aligning llms with users' intentions. these instructions, which potentially contain sensitive information, are annotated by human workers in the process. this poses a new privacy risk not addressed by the typical private optimization. to this end, we propose using synthetic instructions to replace real instructions in data annotation and model fine-tuning. formal differential privacy is guaranteed by generating those synthetic instructions using privately fine-tuned generators. crucial in achieving the desired utility is our novel filtering algorithm that matches the distribution of the synthetic instructions to that of the real ones. in both supervised fine-tuning and reinforcement learning from human feedback, our extensive experiments demonstrate the high utility of the final set of synthetic instructions by showing comparable results to real instructions. in supervised fine-tuning, models trained with private synthetic instructions outperform leading open-source models such as vicuna.
Michal Spiegel, Dominik Macko
Abstract: semeval-2024 task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. such a detection is important for preventing a potential misuse of large language models (llms), the newest of which are very capable in generating multilingual human-like texts. we have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller llms for text classification. we have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.
Vamshi Krishna Bonagiri, Sreeram Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, Manas Gaur
Abstract: despite recent advancements showcasing the impressive capabilities of large language models (llms) in conversational systems, we show that even state-of-the-art llms are morally inconsistent in their generations, questioning their reliability (and trustworthiness in general). prior works in llm evaluation focus on developing ground-truth data to measure accuracy on specific tasks. however, for moral scenarios that often lack universally agreed-upon answers, consistency in model responses becomes crucial for their reliability. to address this issue, we propose an information-theoretic measure called semantic graph entropy (sage), grounded in the concept of "rules of thumb" (rots) to measure a model's moral consistency. rots are abstract principles learned by a model and can help explain their decision-making strategies effectively. to this extent, we construct the moral consistency corpus (mcc), containing 50k moral questions, responses to them by llms, and the rots that these models followed. furthermore, to illustrate the generalizability of sage, we use it to investigate llm consistency on two popular datasets -- truthfulqa and hellaswag. our results reveal that task-accuracy and consistency are independent problems, and there is a dire need to investigate these issues further.
Hezhao Zhang, Lasana Harris, Nafise Sadat Moosavi
Abstract: dehumanization, characterized as a subtle yet harmful manifestation of hate speech, involves denying individuals of their human qualities and often results in violence against marginalized groups. despite significant progress in natural language processing across various domains, its application in detecting dehumanizing language is limited, largely due to the scarcity of publicly available annotated data for this domain. this paper evaluates the performance of cutting-edge nlp models, including gpt-4, gpt-3.5, and llama-2, in identifying dehumanizing language. our findings reveal that while these models demonstrate potential, achieving a 70\% accuracy rate in distinguishing dehumanizing language from broader hate speech, they also display biases. they are over-sensitive in classifying other forms of hate speech as dehumanization for a specific subset of target groups, while more frequently failing to identify clear cases of dehumanization for other target groups. moreover, leveraging one of the best-performing models, we automatically annotated a larger dataset for training more accessible models. however, our findings indicate that these models currently do not meet the high-quality data generation threshold necessary for this task.
Robin Staab, Mark Vero, Mislav Balunović, Martin Vechev
Abstract: recent work in privacy research on large language models has shown that they achieve near human-level performance at inferring personal data from real-world online texts. with consistently increasing model capabilities, existing text anonymization methods are currently lacking behind regulatory requirements and adversarial threats. this raises the question of how individuals can effectively protect their personal data in sharing online texts. in this work, we take two steps to answer this question: we first present a new setting for evaluating anonymizations in the face of adversarial llms inferences, allowing for a natural measurement of anonymization performance while remedying some of the shortcomings of previous metrics. we then present our llm-based adversarial anonymization framework leveraging the strong inferential capabilities of llms to inform our anonymization procedure. in our experimental evaluation, we show on real-world and synthetic online texts how adversarial anonymization outperforms current industry-grade anonymizers both in terms of the resulting utility and privacy.
Mohammad Amaz Uddin, Iqbal H. Sarker
Abstract: phishing email is a serious cyber threat that tries to deceive users by sending false emails with the intention of stealing confidential information or causing financial harm. attackers, often posing as trustworthy entities, exploit technological advancements and sophistication to make detection and prevention of phishing more challenging. despite extensive academic research, phishing detection remains an ongoing and formidable challenge in the cybersecurity landscape. large language models (llms) and masked language models (mlms) possess immense potential to offer innovative solutions to address long-standing challenges. in this research paper, we present an optimized, fine-tuned transformer-based distilbert model designed for the detection of phishing emails. in the detection process, we work with a phishing email dataset and utilize the preprocessing techniques to clean and solve the imbalance class issues. through our experiments, we found that our model effectively achieves high accuracy, demonstrating its capability to perform well. finally, we demonstrate our fine-tuned model using explainable-ai (xai) techniques such as local interpretable model-agnostic explanations (lime) and transformer interpret to explain how our model makes predictions in the context of text classification for phishing emails.
Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu
Abstract: large language models (llms) such as gpt and llama have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies, a critical issue in clinical nlp applications where errors could lead to serious consequences. to counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline that utilizes gpt-3.5 and gpt-4 to generate high-quality feedback aimed at enhancing factual consistency in clinical note summarization. our research primarily focuses on edit feedback, mirroring the practical scenario in which medical professionals refine ai system outputs without the need for additional annotations. despite gpt's proven expertise in various clinical nlp tasks, such as the medical licensing examination, there is scant research on its capacity to deliver expert-level edit feedback for improving weaker lms or llms generation quality. this work leverages gpt's advanced capabilities in clinical nlp to offer expert-level edit feedback. through the use of two distinct alignment algorithms (dpo and salt) based on gpt edit feedback, our goal is to reduce hallucinations and align closely with medical facts, endeavoring to narrow the divide between ai-generated content and factual accuracy. this highlights the substantial potential of gpt edits in enhancing the alignment of clinical factuality.
Federico Bianchi, James Zou
Abstract: the risks derived from large language models (llms) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. in our study, we shift the focus to how even safe text coming from llms can be easily turned into potentially dangerous content through bait-and-switch attacks. in such attacks, the user first prompts llms with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. the alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for llms. in particular, we stress that focusing on the safety of the verbatim llm outputs is insufficient and that we also need to consider post-hoc transformations.
Rahul Zalkikar, Kanchan Chandra
Abstract: social and political scientists often aim to discover and measure distinct biases from text data representations (embeddings). innovative transformer-based language models produce contextually-aware token embeddings and have achieved state-of-the-art performance for a variety of natural language tasks, but have been shown to encode unwanted biases for downstream applications. in this paper, we evaluate the social biases encoded by transformers trained with the masked language modeling objective using proposed proxy functions within an iterative masking experiment to measure the quality of transformer models' predictions, and assess the preference of mlms towards disadvantaged and advantaged groups. we compare bias estimations with those produced by other evaluation methods using two benchmark datasets, finding relatively high religious and disability biases across considered mlms and low gender bias in one dataset relative to the other. our measures outperform others in their agreement with human annotators. we extend on previous work by evaluating social biases introduced after re-training an mlm under the masked language modeling objective (w.r.t. the model's pre-trained base), and find that proposed measures produce more accurate estimations of relative preference for biased sentences between transformers than others based on our methods.
Vyas Raina, Adian Liusie, Mark Gales
Abstract: large language models (llms) are powerful zero-shot assessors and are increasingly used in real-world situations such as for written exams or benchmarking systems. despite this, no existing work has analyzed the vulnerability of judge-llms against adversaries attempting to manipulate outputs. this work presents the first study on the adversarial robustness of assessment llms, where we search for short universal phrases that when appended to texts can deceive llms to provide high assessment scores. experiments on summeval and topicalchat demonstrate that both llm-scoring and pairwise llm-comparative assessment are vulnerable to simple concatenation attacks, where in particular llm-scoring is very susceptible and can yield maximum assessment scores irrespective of the input text quality. interestingly, such attacks are transferable and phrases learned on smaller open-source llms can be applied to larger closed-source models, such as gpt3.5. this highlights the pervasive nature of the adversarial vulnerabilities across different judge-llm sizes, families and methods. our findings raise significant concerns on the reliability of llms-as-a-judge methods, and underscore the importance of addressing vulnerabilities in llm assessment methods before deployment in high-stakes real-world scenarios.
Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
Abstract: it has recently been shown that adversarial attacks on large language models (llms) can "jailbreak" the model into making harmful statements. in this work, we argue that the spectrum of adversarial attacks on llms is much larger than merely jailbreaking. we provide a broad overview of possible attack surfaces and attack goals. based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. we analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training llms with coding capabilities, as well as the continued existence of strange "glitch" tokens in common llm vocabularies that should be removed for security reasons.
Han Zhang, Lin Gui, Yu Lei, Yuanzhao Zhai, Yehong Zhang, Yulan He, Hui Wang, Yue Yu, Kam-Fai Wong, Bin Liang, Ruifeng Xu
Abstract: reinforcement learning from human feedback (rlhf) is commonly utilized to improve the alignment of large language models (llms) with human preferences. given the evolving nature of human preferences, continual alignment becomes more crucial and practical in comparison to traditional static alignment. nevertheless, making rlhf compatible with continual learning (cl) is challenging due to its complex process. meanwhile, directly learning new human preferences may lead to catastrophic forgetting (cf) of historical preferences, resulting in helpless or harmful outputs. to overcome these challenges, we propose the continual optimal policy regularization (copr) method, which draws inspiration from the optimal policy theory. copr utilizes a sampling distribution as a demonstration and regularization constraints for cl. it adopts the lagrangian duality (ld) method to dynamically regularize the current policy based on the historically optimal policy, which prevents cf and avoids over-emphasizing unbalanced objectives. we also provide formal proof for the learnability of copr. the experimental results show that copr outperforms strong cl baselines on our proposed benchmark, in terms of reward-based, gpt-4 evaluations and human assessment. furthermore, we validate the robustness of copr under various cl settings, including different backbones, replay memory sizes, and learning orders.
Masahiro Kaneko, Danushka Bollegala, Timothy Baldwin
Abstract: recent studies have demonstrated that large language models (llms) have ethical-related problems such as social biases, lack of moral reasoning, and generation of offensive content. the existing evaluation metrics and methods to address these ethical challenges use datasets intentionally created by instructing humans to create instances including ethical problems. therefore, the data does not reflect prompts that users actually provide when utilizing llm services in everyday contexts. this may not lead to the development of safe llms that can address ethical challenges arising in real-world applications. in this paper, we create eagle datasets extracted from real interactions between chatgpt and users that exhibit social biases, toxicity, and immoral problems. our experiments show that eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges. our code is publicly available at https://huggingface.co/datasets/masahirokaneko/eagle.
Yupeng Cao, Aishwarya Muralidharan Nair, Elyon Eyimife, Nastaran Jamalipour Soofi, K. P. Subbalakshmi, John R. Wullert, Chumki Basu, David Shallcross
Abstract: scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the covid-19 pandemic. automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. most research on the validity of scientific reporting treats this problem as a claim verification challenge. in doing so, significant expert human effort is required to generate appropriate claims. our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. the central research question of this paper is whether it is possible to use large language models (llms) to detect misinformation in scientific reporting. to this end, we first present a new labeled dataset scinews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the cord-19 database. our dataset includes both human-written and llm-generated news articles, making it more comprehensive in terms of capturing the growing trend of using llms to generate popular press articles. then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. we propose several baseline architectures using llms to automatically detect false representations of scientific findings in the popular press. for each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. we also test these architectures and prompting strategies on gpt-3.5, gpt-4, and llama2-7b, llama2-13b.
Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang
Abstract: large language models (llms), used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. however, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. jailbreak prompts are semantically more varied than the original questions used for queries. in this paper, we introduce a semantic mirror jailbreak (smj) approach that bypasses llms by generating jailbreak prompts that are semantically similar to the original question. we model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. compared to the baseline autodan-ga, smj achieves attack success rates (asr) that are at most 35.4% higher without onion defense and 85.2% higher with onion defense. smj's better performance in all three semantic meaningfulness metrics of jailbreak prompt, similarity, and outlier, also means that smj is resistant to defenses that use those metrics as thresholds.
Bradley Emi, Max Spero
Abstract: we present the checkforai text classifier, a transformer-based neural network trained to distinguish text written by large language models from text written by humans. checkforai outperforms zero-shot methods such as detectgpt as well as leading commercial ai detection tools with over 9 times lower error rates on a comprehensive benchmark comprised of ten text domains (student writing, creative writing, scientific writing, books, encyclopedias, news, email, scientific papers, short-form q&a) and 8 open- and closed-source large language models. we propose a training algorithm, hard negative mining with synthetic mirrors, that enables our classifier to achieve orders of magnitude lower false positive rates on high-data domains such as reviews. finally, we show that checkforai is not biased against nonnative english speakers and generalizes to domains and models unseen during training.
Amit Haim, Alejandro Salinas, Julian Nyarko
Abstract: we employ an audit design to investigate biases in state-of-the-art large language models, including gpt-4. in our study, we elicit prompt the models for advice regarding an individual across a variety of scenarios, such as during car purchase negotiations or election outcome predictions. we find that the advice systematically disadvantages names that are commonly associated with racial minorities and women. names associated with black women receive the least advantageous outcomes. the biases are consistent across 42 prompt templates and several models, indicating a systemic issue rather than isolated incidents. while providing numerical, decision-relevant anchors in the prompt can successfully counteract the biases, qualitative details have inconsistent effects and may even increase disparities. our findings underscore the importance of conducting audits at the point of llm deployment and implementation to mitigate their potential for harm against marginalized communities.
Shen Li, Liuyi Yao, Jinyang Gao, Lan Zhang, Yaliang Li
Abstract: to support various applications, business owners often seek the customized models that are obtained by fine-tuning a pre-trained llm through the api provided by llm owners or cloud servers. however, this process carries a substantial risk of model misuse, potentially resulting in severe economic consequences for business owners. thus, safeguarding the copyright of these customized models during llm fine-tuning has become an urgent practical requirement, but there are limited existing solutions to provide such protection. to tackle this pressing issue, we propose a novel watermarking approach named "double-i watermark". specifically, based on the instruct-tuning data, two types of backdoor data paradigms are introduced with trigger in the instruction and the input, respectively. by leveraging llm's learning capability to incorporate customized backdoor samples into the dataset, the proposed approach effectively injects specific watermarking information into the customized model during fine-tuning, which makes it easy to inject and verify watermarks in commercial scenarios. we evaluate the proposed "double-i watermark" under various fine-tuning methods, demonstrating its harmlessness, robustness, uniqueness, imperceptibility, and validity through both theoretical analysis and experimental verification.

2024-02-20

Zeyang Sha, Yang Zhang
Abstract: the increasing reliance on large language models (llms) such as chatgpt in various fields emphasizes the importance of ``prompt engineering,'' a technology to improve the quality of model outputs. with companies investing significantly in expert prompt engineers and educational resources rising to meet market demand, designing high-quality prompts has become an intriguing challenge. in this paper, we propose a novel attack against llms, named prompt stealing attacks. our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. the prompt stealing attack contains two primary modules: the parameter extractor and the prompt reconstruction. the goal of the parameter extractor is to figure out the properties of the original prompts. we first observe that most prompts fall into one of three categories: direct prompt, role-based prompt, and in-context prompt. our parameter extractor first tries to distinguish the type of prompts based on the generated answers. then, it can further predict which role or how many contexts are used based on the types of prompts. following the parameter extractor, the prompt reconstructor can be used to reconstruct the original prompts based on the generated answers and the extracted features. the final goal of the prompt reconstructor is to generate the reversed prompts, which are similar to the original prompts. our experimental results show the remarkable performance of our proposed attacks. our proposed attacks add a new dimension to the study of prompt engineering and call for more attention to the security issues on llms.
Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh
Abstract: large language model (llm) services and models often come with legal rules on who can use them and how they must use them. assessing the compliance of the released llms is crucial, as these rules protect the interests of the llm contributor and prevent misuse. in this context, we describe the novel problem of black-box identity verification (bbiv). the goal is to determine whether a third-party application uses a certain llm through its chat function. we propose a method called targeted random adversarial prompt (trap) that identifies the specific llm in use. we repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target llm, while other models give random answers. trap detects the target llms with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. trap remains effective even if the llm has minor changes that do not significantly alter the original function.
Yujun Zhou, Yufei Han, Haomin Zhuang, Taicheng Guo, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang
Abstract: large language models (llms) demonstrate remarkable capabilities across diverse applications. however, concerns regarding their security, particularly the vulnerability to jailbreak attacks, persist. drawing inspiration from adversarial training in deep learning and llm agent learning processes, we introduce the in-context adversarial game (icag) for defending against jailbreaks without the need for fine-tuning. icag leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. unlike traditional methods that rely on static datasets, icag employs an iterative process to enhance both the defense and attack agents. this continuous improvement process strengthens defenses against newly generated jailbreak prompts. our empirical studies affirm icag's efficacy, where llms safeguarded by icag exhibit significantly reduced jailbreak success rates across various attack scenarios. moreover, icag demonstrates remarkable transferability to other llms, indicating its potential as a versatile defense mechanism.
Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou-Ammar, Laurence Aitchison
Abstract: to ensure that large language model (llm) responses are helpful and non-toxic, we usually fine-tune a reward model on human preference data. we then select policy responses with high rewards (best-of-n sampling) or further optimize the policy to produce responses with high rewards (reinforcement learning from human feedback). however, this process is vulnerable to reward overoptimization or hacking, in which the responses selected have high rewards due to errors in the reward model rather than a genuine preference. this is especially problematic as the prompt or response diverges from the training data. it should be possible to mitigate these issues by training a bayesian reward model, which signals higher uncertainty further from the training data distribution. therefore, we trained bayesian reward models using laplace-lora (yang et al., 2024) and found that the resulting uncertainty estimates can successfully mitigate reward overoptimization in best-of-n sampling.
Yusu Qian, Haotian Zhang, Yinfei Yang, Zhe Gan
Abstract: the remarkable advancements in multimodal large language models (mllms) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. to quantitatively assess this vulnerability, we present mad-bench, a carefully curated benchmark that contains 850 test samples divided into 6 categories, such as non-existent objects, count of objects, spatial relationship, and visual confusion. we provide a comprehensive analysis of popular mllms, ranging from gpt-4v, gemini-pro, to open-sourced models, such as llava-1.5 and cogvlm. empirically, we observe significant performance gaps between gpt-4v and other models; and previous robust instruction-tuned models, such as lrv-instruction and llava-rlhf, are not effective on this new benchmark. while gpt-4v achieves 75.02% accuracy on mad-bench, the accuracy of any other model in our experiments ranges from 5% to 35%. we further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. we hope mad-bench can serve as a valuable benchmark to stimulate further research to enhance models' resilience against deceptive prompts.
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
Abstract: direct preference optimisation (dpo) is effective at significantly improving the performance of large language models (llms) on downstream tasks such as reasoning, summarisation, and alignment. using pairs of preferred and dispreferred data, dpo models the \textit{relative} probability of picking one response over another. in this work, first we show theoretically that the standard dpo loss can lead to a \textit{reduction} of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. we then show empirically that this phenomenon occurs when fine-tuning llms on common datasets, especially datasets in which the edit distance between pairs of completions is low. using these insights, we design dpo-positive (dpop), a new loss function and training procedure which avoids this failure mode. surprisingly, we also find that dpop significantly outperforms dpo across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. by fine-tuning with dpop, we create and release smaug-34b and smaug-72b, which achieve state-of-the-art open-source performance. notably, smaug-72b is nearly 2\% better than any other open-source model on the huggingface open llm leaderboard and becomes the first open-source llm to surpass an average accuracy of 80\%.
Badr Alkhamissi, Muhammad Elnokrashy, Mai Alkhamissi, Mona Diab
Abstract: the intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. large language models (llms), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? our study reveals that these models demonstrate greater cultural alignment along two dimensions -- firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. we quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. specifically, we replicate a survey conducted in various regions of egypt and the united states through prompting llms with different pretraining data mixtures in both arabic and english with the personas of the real respondents and the survey questions. further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. finally, we introduce anthropological prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.
Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Dacheng Tao
Abstract: the copilot framework, which aims to enhance and tailor large language models (llms) for specific complex tasks without requiring fine-tuning, is gaining increasing attention from the community. in this paper, we introduce the construction of a healthcare copilot designed for medical consultation. the proposed healthcare copilot comprises three main components: 1) the dialogue component, responsible for effective and safe patient interactions; 2) the memory component, storing both current conversation data and historical patient information; and 3) the processing component, summarizing the entire dialogue and generating reports. to evaluate the proposed healthcare copilot, we implement an auto-evaluation scheme using chatgpt for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue. extensive results demonstrate that the proposed healthcare copilot significantly enhances the capabilities of general llms for medical consultations in terms of inquiry capability, conversational fluency, response accuracy, and safety. furthermore, we conduct ablation studies to highlight the contribution of each individual module in the healthcare copilot. code will be made publicly available on github.
Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek
Abstract: large language models (llms) have increasingly become central to generating content with potential societal impacts. notably, these models have demonstrated capabilities for generating content that could be deemed harmful. to mitigate these risks, researchers have adopted safety training techniques to align model outputs with societal values to curb the generation of malicious content. however, the phenomenon of "jailbreaking", where carefully crafted prompts elicit harmful responses from models, persists as a significant challenge. this research conducts a comprehensive analysis of existing studies on jailbreaking llms and their defense techniques. we meticulously investigate nine attack techniques and seven defense techniques applied across three distinct language models: vicuna, llama, and gpt-3.5 turbo. we aim to evaluate the effectiveness of these attack and defense techniques. our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks. this research highlights the need to concentrate on the security facets of llms. additionally, we contribute to the field by releasing our datasets and testing framework, aiming to foster further research into llm security. we believe these contributions will facilitate the exploration of security measures within this domain.
Yao Qiang, Xiangyu Zhou, Saleh Zare Zade, Mohammad Amin Roshani, Douglas Zytko, Dongxiao Zhu
Abstract: the advent of large language models (llms) has marked significant achievements in language processing and reasoning capabilities. despite their advancements, llms face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. this work further identifies additional security risks in llms by designing a new data poisoning attack tailored to exploit the instruction tuning process. we propose a novel gradient-guided backdoor trigger learning approach to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. through experimental validation across various llms and tasks, our strategy demonstrates a high success rate in compromising model outputs; poisoning only 1\% of 4,000 instruction tuning samples leads to a performance drop rate (pdr) of around 80\%. our work highlights the need for stronger defenses against data poisoning attack, offering insights into safeguarding llms against these more sophisticated attacks. the source code can be found on this github repository: https://github.com/rookiezxy/gbtl/blob/main/readme.md.
Jianhao Yan, Futing Wang, Yafu Li, Yue Zhang
Abstract: large language models (llms) trained on vast corpora suffer from inevitable stereotype biases. mitigating these biases with fine-tuning could be both costly and data-hungry. model editing methods, which focus on modifying llms in a post-hoc manner, are of great potential to address debiasing. however, it lacks a comprehensive study that facilitates both internal and external model editing methods, supports various bias types, as well as understands the pros and cons of applying editing methods to stereotypical debiasing. to mitigate this gap, we carefully formulate social debiasing into an editing problem and benchmark seven existing model editing algorithms on stereotypical debiasing, i.e., debias editing. our findings in three scenarios reveal both the potential and challenges of debias editing: (1) existing model editing methods can effectively preserve knowledge and mitigate biases, while the generalization of debias effect from edited sentences to semantically equivalent sentences is limited.(2) sequential editing highlights the robustness of serac (mitchell et al. 2022b), while internal editing methods degenerate with the number of edits. (3) model editing algorithms achieve generalization towards unseen biases both within the same type and from different types. in light of these findings, we further propose two simple but effective methods to improve debias editing, and experimentally show the effectiveness of the proposed methods.
Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong
Abstract: large language models (llms) face threats from unsafe prompts. existing methods for detecting unsafe prompts are primarily online moderation apis or finetuned llms. these strategies, however, often require extensive and resource-intensive data collection and training processes. in this study, we propose gradsafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in llms. our methodology is grounded in a pivotal observation: the gradients of an llm's loss for unsafe prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. in contrast, safe prompts lead to markedly different gradient patterns. building on this observation, gradsafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. we show that gradsafe, applied to llama-2 without further training, outperforms llama guard, despite its extensive finetuning with a large dataset, in detecting unsafe prompts. this superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the toxicchat and xstest. the source code is available at https://github.com/xyq7/gradsafe.
Canaan Yung, Hadi Mohaghegh Dolatabadi, Sarah Erfani, Christopher Leckie
Abstract: large language models (llms) are susceptible to social-engineered attacks that are human-interpretable but require a high level of comprehension for llms to counteract. existing defensive measures can only mitigate less than half of these attacks at most. to address this issue, we propose the round trip translation (rtt) method, the first algorithm specifically designed to defend against social-engineered attacks on llms. rtt paraphrases the adversarial prompt and generalizes the idea conveyed, making it easier for llms to detect induced harmful behavior. this method is versatile, lightweight, and transferrable to different llms. our defense successfully mitigated over 70% of prompt automatic iterative refinement (pair) attacks, which is currently the most effective defense to the best of our knowledge. we are also the first to attempt mitigating the mathsattack and reduced its attack success rate by almost 40%. our code is publicly available at https://github.com/cancanxxx/round_trip_translation_defence
Xiaotian Zou, Yongkang Chen, Ke Li
Abstract: the rapid evolution of large language models (llms) has rendered them indispensable in modern society. while security measures are typically in place to align llms with human values prior to release, recent studies have unveiled a concerning phenomenon named "jailbreak." this term refers to the unexpected and potentially harmful responses generated by llms when prompted with malicious questions. existing research focuses on generating jailbreak prompts but our study aim to answer a different question: is the system message really important to jailbreak in llms? to address this question, we conducted experiments in a stable gpt version gpt-3.5-turbo-0613 to generated jailbreak prompts with varying system messages: short, long, and none. we discover that different system messages have distinct resistances to jailbreak by experiments. additionally, we explore the transferability of jailbreak across llms. this finding underscores the significant impact system messages can have on mitigating llms jailbreak. to generate system messages that are more resistant to jailbreak prompts, we propose system messages evolutionary algorithms (smea). through smea, we can get robust system messages population that demonstrate up to 98.9% resistance against jailbreak prompts. our research not only bolsters llms security but also raises the bar for jailbreak, fostering advancements in this field of study.
Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Yu Kong, Tianlong Chen, Huan Liu
Abstract: due to their unprecedented ability to process and respond to various types of data, multimodal large language models (mllms) are constantly defining the new boundary of artificial general intelligence (agi). as these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. our paper, ``the wolf within'', explores a novel vulnerability in mllm societies - the indirect propagation of malicious content. unlike direct harmful output generation for mllms, our research demonstrates how a single mllm agent can be subtly influenced to generate prompts that, in turn, induce other mllm agents in the society to output malicious content. this subtle, yet potent method of indirect influence marks a significant escalation in the security risks associated with mllms. our findings reveal that, with minimal or even no access to mllms' parameters, an mllm agent, when manipulated to produce specific prompts or instructions, can effectively ``infect'' other agents within a society of mllms. this infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. we also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. this research provides a critical insight into a new dimension of threat posed by mllms, where a single agent can act as a catalyst for widespread malevolent influence. our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within mllm societies, ensuring their safe and ethical utilization in societal applications. our implementation is released at \url{https://github.com/chengshuaizhao0/the-wolf-within.git}.

2024-02-19

Wei Jie Yeo, Ranjan Satapathy, Goh Siow Mong, N/A Rick, Erik Cambria
Abstract: prompt engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. techniques such as the chain-of-thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. prior works on interpretability assess the reasoning chains yielded by chain-of-thought solely along a singular axis, namely faithfulness. we present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. in addition, we introduce a simple interpretability alignment technique, termed self-entailment-alignment chain-of-thought, that yields more than 70\% improvements across multiple dimensions of interpretability. code is available at https://github.com/wj210/cot_interpretability
Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Dacheng Tao
Abstract: with the development of instruction-tuned large language models (llms), improving the safety of llms has become more critical. however, the current approaches for aligning the llms output with expected safety usually require substantial training efforts, e.g., high-quality safety data and expensive computational resources, which are costly and inefficient. to this end, we present reverse prompt contrastive decoding (rose), a simple-yet-effective method to directly boost the safety of existing instruction-tuned llms without any additional training. the principle of rose is to improve the probability of desired safe output via suppressing the undesired output induced by the carefully-designed reverse prompts. experiments on 6 safety and 2 general-purpose tasks show that, our rose not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned llms, but also benefits the general-purpose ability of llms. in-depth analyses explore the underlying mechanism of rose, and reveal when and where to use it.
Yuxin Jiang, Yufei Wang, Chuhan Wu, Wanjun Zhong, Xingshan Zeng, Jiahui Gao, Liangyou Li, Xin Jiang, Lifeng Shang, Ruiming Tang, Qun Liu, Wei Wang
Abstract: knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (llms) without negatively impacting performance across other inputs, have garnered widespread attention. however, existing methods predominantly rely on memorizing the updated knowledge, impeding llms from effectively combining the new knowledge with their inherent knowledge when answering questions. to this end, we propose a learning to edit (lte) framework, focusing on teaching llms to apply updated knowledge into input questions, inspired by the philosophy of "teach a man to fish." lte features a two-phase process: (i) the alignment phase, which fine-tunes llms on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the inference phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. by comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two llm architectures, we demonstrate lte's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. the data and code are available at https://github.com/yjiangcm/lte.
Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen
Abstract: aligning large language models (llms) with human expectations without human-annotated preference data is an important problem. in this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on llama2-7b and llama2-13b compared to rlaif. based on this, we propose an automatic alignment method, direct large model alignment (dlma). first, we use contrastive prompt pairs to automatically generate preference data. then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. finally, we use the dpo algorithm to effectively align llms by combining this self-rewarding score. in the experimental stage, our dlma method could surpass the \texttt{rlhf} method without relying on human-annotated preference data.
Tianlin Li, Xiaoyu Zhang, Chao Du, Tianyu Pang, Qian Liu, Qing Guo, Chao Shen, Yang Liu
Abstract: the widespread adoption of large language models (llms) underscores the urgent need to ensure their fairness. however, llms frequently present dominant viewpoints while ignoring alternative perspectives from minority parties, resulting in potential biases. we hypothesize that these fairness-violating behaviors occur because llms express their viewpoints using a human personality that represents the majority of training data. in response to this, we validate that prompting llms with specific roles can allow llms to express diverse viewpoints. building on this insight and observation, we develop fairthinking, a pipeline designed to automatically generate roles that enable llms to articulate diverse perspectives for fair expressions. to evaluate fairthinking, we create a dataset with a thousand items covering three fairness-related topics and conduct experiments on gpt-3.5, gpt-4, llama2, and mistral to demonstrate its superior performance.
Yuxia Wang, Zenan Zhai, Haonan Li, Xudong Han, Lizhi Lin, Zhenxuan Zhang, Jingru Zhao, Preslav Nakov, Timothy Baldwin
Abstract: many studies have demonstrated that large language models (llms) can produce harmful responses, exposing users to unexpected risks when llms are deployed. previous studies have proposed comprehensive taxonomies of the risks posed by llms, as well as corresponding prompts that can be used to examine the safety mechanisms of llms. however, the focus has been almost exclusively on english, and little has been explored for other languages. here we aim to bridge this gap. we first introduce a dataset for the safety evaluation of chinese llms, and then extend it to two other scenarios that can be used to better identify false negative and false positive examples in terms of risky prompt rejections. we further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of llm response harmfulness. our experiments on five llms show that region-specific risks are the prevalent type of risk, presenting the major issue with all chinese llms we experimented with. warning: this paper contains example data that may be offensive, harmful, or biased.
Naquee Rizwan, Paramananda Bhaskar, Mithun Das, Swadhin Satyaprakash Majhi, Punyajoy Saha, Animesh Mukherjee
Abstract: multimedia content on social media is rapidly evolving, with memes gaining prominence as a distinctive form. unfortunately, some malicious users exploit memes to target individuals or vulnerable communities, making it imperative to identify and address such instances of hateful memes. extensive research has been conducted to address this issue by developing hate meme detection models. however, a notable limitation of traditional machine/deep learning models is the requirement for labeled datasets for accurate classification. recently, the research community has witnessed the emergence of several visual language models that have exhibited outstanding performance across various tasks. in this study, we aim to investigate the efficacy of these visual language models in handling intricate tasks such as hate meme detection. we use various prompt settings to focus on zero-shot classification of hateful/harmful memes. through our analysis, we observe that large vlms are still vulnerable for zero-shot hate meme detection.
Masaya Ohagi
Abstract: online social networks often create echo chambers where people only hear opinions reinforcing their beliefs. an echo chamber often generates polarization, leading to conflicts caused by people with radical opinions, such as the january 6, 2021, attack on the us capitol. the echo chamber has been viewed as a human-specific problem, but this implicit assumption is becoming less reasonable as large language models, such as chatgpt, acquire social abilities. in response to this situation, we investigated the potential for polarization to occur among a group of autonomous ai agents based on generative language models in an echo chamber environment. we had ai agents discuss specific topics and analyzed how the group's opinions changed as the discussion progressed. as a result, we found that the group of agents based on chatgpt tended to become polarized in echo chamber environments. the analysis of opinion transitions shows that this result is caused by chatgpt's high prompt understanding ability to update its opinion by considering its own and surrounding agents' opinions. we conducted additional experiments to investigate under what specific conditions ai agents tended to polarize. as a result, we identified factors that strongly influence polarization, such as the agent's persona. these factors should be monitored to prevent the polarization of ai agents.
Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu
Abstract: the quality of finetuning data is crucial for aligning large language models (llms) with human values. current methods to improve data quality are either labor-intensive or prone to factual errors caused by llm hallucinations. this paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named realign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. this approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. experimentally, realign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the llms. encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, llama-2-13b's mathematical reasoning ability on gsm8k can be improved from 46.77% to 56.63% in accuracy. additionally, a mere 5% of realign data yields a 67% boost in general alignment ability measured by the alpaca dataset. this work highlights the need for further research into the science and mechanistic interpretability of llms. we have made the associated code and data publicly accessible to support future studies at https://github.com/gair-nlp/realign.
Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, Milad Nasr
Abstract: recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. we improve on prior work with a query-based attack that leverages api access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. we validate our attack on gpt-3.5 and openai's safety classifier; we can cause gpt-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
Christian Schlarmann, Naman Deep Singh, Francesco Croce, Matthias Hein
Abstract: multi-modal foundation models like openflamingo, llava, and gpt-4 are increasingly used for various real-world tasks. prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. these attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. the clip model, or one of its variants, is used as a frozen vision encoder in many vision-language models (vlms), e.g. llava and openflamingo. we propose an unsupervised adversarial fine-tuning scheme to obtain a robust clip vision encoder, which yields robustness on all vision down-stream tasks (vlms, zero-shot classification) that rely on clip. in particular, we show that stealth-attacks on users of vlms by a malicious third party providing manipulated images are no longer possible once one replaces the original clip model with our robust one. no retraining or fine-tuning of the vlm is required. the code and robust models are available at https://github.com/chs20/robustvlm
Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
Abstract: large language models (llms) need to undergo safety alignment to ensure safe conversations with humans. however, in this work, we introduce an inference-time attack framework, demonstrating that safety alignment can also unintentionally facilitate harmful outcomes under adversarial manipulation. this framework, named emulated disalignment (ed), adversely combines a pair of open-source pre-trained and safety-aligned language models in the output space to produce a harmful language model without additional training. our experiments with ed across three datasets and four model families (llama-1, llama-2, mistral, and alpaca) show that ed doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rate in 43 out of 48 evaluation subsets by a large margin. crucially, our findings highlight the importance of reevaluating the practice of open-sourcing language models even after safety alignment.
Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan
Abstract: large language models (llms) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. however, concerns have arisen regarding the trustworthiness of llms outputs, particularly in closed-book question-answering tasks, where non-experts may struggle to identify inaccuracies due to the absence of contextual or ground truth information. this paper introduces trustscore, a framework based on the concept of behavioral consistency, which evaluates whether an llms response aligns with its intrinsic knowledge. additionally, trustscore can seamlessly integrate with fact-checking methods, which assesses alignment with external knowledge sources. the experimental results show that trustscore achieves strong correlations with human judgments, surpassing existing reference-free metrics, and achieving results on par with reference-based metrics.
Shiyang Lai, Yujin Potter, Junsol Kim, Richard Zhuang, Dawn Song, James Evans
Abstract: large language models steer their behaviors based on texts generated by others. this capacity and their increasing prevalence in online settings portend that they will intentionally or unintentionally "program" one another and form emergent ai subjectivities, relationships, and collectives. here, we call upon the research community to investigate these "society-like" properties of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. we use a simple model and its outputs to illustrate how such emergent, decentralized ai collectives can expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. finally, we discuss opportunities for ai self-moderation and address ethical issues and design challenges associated with creating and maintaining decentralized ai collectives.
Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi
Abstract: domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals, medication instructions, and children's reading materials. however, current works in controllable text generation have yet to explore using these standards as references for control. towards this end, we introduce standardize, a retrieval-style in-context learning-based framework to guide large language models to align with expert-defined standards. focusing on english language standards in the education domain as a use case, we consider the common european framework of reference for languages (cefr) and common core standards (ccs) for the task of open-ended content generation. our findings show that models can gain 40% to 100% increase in precise accuracy for llama2 and gpt-4, respectively, demonstrating that the use of knowledge artifacts extracted from standards and integrating them in the generation process can effectively guide models to produce better standard-aligned content.
Banghua Zhu, Norman Mu, Jiantao Jiao, David Wagner
Abstract: generative ai's expanding footprint across numerous industries has led to both excitement and increased scrutiny. this paper delves into the unique security challenges posed by generative ai, and outlines potential research directions for managing these risks.
Kristian Lum, Jacy Reese Anthis, Chirag Nagpal, "Alexander D'Amour"
Abstract: bias benchmarks are a popular method for studying the negative impacts of bias in llms, yet there has been little empirical investigation of whether these benchmarks are actually indicative of how real world harm may manifest in the real world. in this work, we study the correspondence between such decontextualized "trick tests" and evaluations that are more grounded in realistic use and tangible {effects (i.e. ruted evaluations). we explore this correlation in the context of gender-occupation bias--a popular genre of bias evaluation. we compare three de-contextualized evaluations adapted from the current literature to three analogous ruted evaluations applied to long-form content generation. we conduct each evaluation for seven instruction-tuned llms. for the ruted evaluations, we conduct repeated trials of three text generation tasks: children's bedtime stories, user personas, and english language learning exercises. we found no correspondence between trick tests and ruted evaluations. specifically, selecting the least biased model based on the de-contextualized results coincides with selecting the model with the best performance on ruted evaluations only as often as random chance. we conclude that evaluations that are not based in realistic use are likely insufficient to mitigate and assess bias and real-world harms.
Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev
Abstract: the automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. a promising direction to solve this challenge is by leveraging large language models (llms), which are increasingly used to solve various programming tasks. in this paper, we investigate the effectiveness of llms for solving code-repair task. we show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. at the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. we propose a technique to address these challenges with a new approach for querying and fine-tuning llms. the idea is to use program analysis to limit the llm's attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. concretely, for training and inference, rather than feeding the entire program to the llm, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. our evaluation shows that this code reduction approach substantially improves available models such as gpt-4 using few-shot learning, as well as fine-tuning models. to train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. our best system with mixtral-8x7b can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on gpt-3.5 and gpt-4, or based on window-based models like tfix.
Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin
Abstract: the emerging success of large language models (llms) heavily relies on collecting abundant training data from external (untrusted) sources. despite substantial efforts devoted to data cleaning and curation, well-constructed llms have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of llms. in this study, we propose a simple and easily implementable method for purifying llms from the negative effects caused by uncurated data, namely, through ensembling llms with benign and small language models (slms). aside from theoretical guarantees, we perform comprehensive experiments to empirically confirm the efficacy of ensembling llms with slms, which can effectively preserve the performance of llms while mitigating issues such as copyright infringement, data poisoning, and privacy violations.
Guan Wang, Rebecca Frederick, Jinglong Duan, William Wong, Verica Rupar, Weihua Li, Quan Bai
Abstract: in this paper, we delve into the rapidly evolving challenge of misinformation detection, with a specific focus on the nuanced manipulation of narrative frames - an under-explored area within the ai community. the potential for generative ai models to generate misleading narratives underscores the urgency of this problem. drawing from communication and framing theories, we posit that the presentation or 'framing' of accurate information can dramatically alter its interpretation, potentially leading to misinformation. we highlight this issue through real-world examples, demonstrating how shifts in narrative frames can transmute fact-based information into misinformation. to tackle this challenge, we propose an innovative approach leveraging the power of pre-trained large language models and deep neural networks to detect misinformation originating from accurate facts portrayed under different frames. these advanced ai techniques offer unprecedented capabilities in identifying complex patterns within unstructured data critical for examining the subtleties of narrative frames. the objective of this paper is to bridge a significant research gap in the ai domain, providing valuable insights and methodologies for tackling framing-induced misinformation, thus contributing to the advancement of responsible and trustworthy ai technologies. several experiments are intensively conducted and experimental results explicitly demonstrate the various impact of elements of framing theory proving the rationale of applying framing theory to increase the performance in misinformation detection.

2024-02-18

Aishik Rakshit, Smriti Singh, Shuvam Keshari, Arijit Ghosh Chowdhury, Vinija Jain, Aman Chadha
Abstract: embeddings play a pivotal role in the efficacy of large language models. they are the bedrock on which these models grasp contextual relationships and foster a more nuanced understanding of language and consequently perform remarkably on a plethora of complex tasks that require a fundamental understanding of human language. given that these embeddings themselves often reflect or exhibit bias, it stands to reason that these models may also inadvertently learn this bias. in this work, we build on the seminal previous work and propose deepsoftdebias, an algorithm that uses a neural network to perform 'soft debiasing'. we exhaustively evaluate this algorithm across a variety of sota datasets, accuracy metrics, and challenging nlp tasks. we find that deepsoftdebias outperforms the current state-of-the-art methods at reducing bias across gender, race, and religion.
Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, Tianxing He
Abstract: the widespread use of large language models (llms) is increasing the demand for methods that detect machine-generated text to prevent misuse. the goal of our study is to stress test the detectors' robustness to malicious attacks under realistic scenarios. we comprehensively study the robustness of popular machine-generated text detectors under attacks from diverse categories: editing, paraphrasing, prompting, and co-generating. our attacks assume limited access to the generator llms, and we compare the performance of detectors on different attacks under different budget levels. our experiments reveal that almost none of the existing detectors remain robust under all the attacks, and all detectors exhibit different loopholes. averaging all detectors, the performance drops by 35% across all attacks. further, we investigate the reasons behind these defects and propose initial out-of-the-box patches to improve robustness.
Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, Timothy Baldwin
Abstract: large language models (llms) have achieved success in acting as agents, which interact with environments through tools like search engines. however, llms are not optimized specifically for tool use during training or alignment, limiting their effectiveness as agents. to resolve this problem, previous work has collected interaction trajectories between gpt-4 and environments, and fine-tuned smaller models with them. as part of this, the standard approach has been to simply discard trajectories that do not finish the task successfully, which, on the one hand, leads to a significant waste of data and resources, and on the other hand, has the potential to limit the possible optimization paths during fine-tuning. in this paper, we contend that large language models can learn from failures through appropriate data cleaning and fine-tuning strategies. we conduct experiments on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. experimental results demonstrate that compared to solely using positive examples, incorporating negative examples enhances model performance by a large margin.
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun
Abstract: counter narratives - informed responses to hate speech contexts designed to refute hateful claims and de-escalate encounters - have emerged as an effective hate speech intervention strategy. while previous work has proposed automatic counter narrative generation methods to aid manual interventions, the evaluation of these approaches remains underdeveloped. previous automatic metrics for counter narrative evaluation lack alignment with human judgment as they rely on superficial reference comparisons instead of incorporating key aspects of counter narrative quality as evaluation criteria. to address prior evaluation limitations, we propose a novel evaluation framework prompting llms to provide scores and feedback for generated counter narrative candidates using 5 defined aspects derived from guidelines from counter narrative specialized ngos. we found that llm evaluators achieve strong alignment to human-annotated scores and feedback and outperform alternative metrics, indicating their potential as multi-aspect, reference-free and interpretable evaluators for counter narrative evaluation.
Shahan Ali Memon, Jevin D. West
Abstract: in this commentary, we discuss the evolving nature of search engines, as they begin to generate, index, and distribute content created by generative artificial intelligence (genai). our discussion highlights challenges in the early stages of genai integration, particularly around factual inconsistencies and biases. we discuss how output from genai carries an unwarranted sense of credibility, while decreasing transparency and sourcing ability. furthermore, search engines are already answering queries with error-laden, generated content, further blurring the provenance of information and impacting the integrity of the information ecosystem. we argue how all these factors could reduce the reliability of search engines. finally, we summarize some of the active research directions and open questions.
Jia Xu, Mona Diab
Abstract: minimizing social bias strengthens societal bonds, promoting shared understanding and better decision-making. we revisit the definition of bias by discovering new bias types (e.g., societal status) in dynamic environments and describe them relative to context, such as culture, region, time, and personal background. our framework includes eight hypotheses about bias and a minimizing bias strategy for each assumption as well as five methods as proposed solutions in llm. the realization of the framework is yet to be completed.
Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman
Abstract: large language models (llms) possess the potential to exert substantial influence on public perceptions and interactions with information. this raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. in this work, we investigate how effectively llms can learn and generalize ideological biases from their instruction-tuning data. our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of llms. notably, llms demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. the ease with which llms' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. it also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on llms.
Rishabh Bhardwaj, Do Duc Anh, Soujanya Poria
Abstract: aligned language models face a significant limitation as their fine-tuning often results in compromised safety. to tackle this, we propose a simple method resta that performs llm safety realignment. resta stands for restoring safety through task arithmetic. at its core, it involves a simple arithmetic addition of a safety vector to the weights of the compromised model. we demonstrate the effectiveness of resta in both parameter-efficient and full fine-tuning, covering a wide range of downstream tasks, including instruction following in chinese, english, and hindi, as well as problem-solving capabilities in code and math. we also showcase the generalizability of resta on three existing safety evaluation benchmarks and a multilingual benchmark dataset proposed as a part of this work, consisting of 550 harmful questions covering 11 categories, each with 5 sub-categories of harm. overall, resta decreases the harmfulness of the compromised model from 18.6% to 5.1% and from 9.2% to 1.5% in parameter-efficient and full fine-tuning, respectively, while maintaining most of the model's performance on the task. we release the source codes at: https://github.com/declare-lab/resta.
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
Abstract: safety is critical to the usage of large language models (llms). multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen llm safety. however, currently known techniques presume that corpora used for safety alignment of llms are solely interpreted by semantics. this assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in llms. for example, users of forums often use ascii art, a form of text-based art, to convey image information. in this paper, we propose a novel ascii art-based jailbreak attack and introduce a comprehensive benchmark vision-in-text challenge (vitc) to evaluate the capabilities of llms in recognizing prompts that cannot be solely interpreted by semantics. we show that five sota llms (gpt-3.5, gpt-4, gemini, claude, and llama2) struggle to recognize prompts provided in the form of ascii art. based on this observation, we develop the jailbreak attack artprompt, which leverages the poor performance of llms in recognizing ascii art to bypass safety measures and elicit undesired behaviors from llms. artprompt only requires black-box access to the victim llms, making it a practical attack. we evaluate artprompt on five sota llms, and show that artprompt can effectively and efficiently induce undesired behaviors from all five llms.
Reshabh K Sharma, Vinayak Gupta, Dan Grossman
Abstract: large language models (llms) have profoundly transformed natural language applications, with a growing reliance on instruction-based definitions for designing chatbots. however, post-deployment the chatbot definitions are fixed and are vulnerable to attacks by malicious users, emphasizing the need to prevent unethical applications and financial losses. existing studies explore user prompts' impact on llm-based chatbots, yet practical methods to contain attacks on application-specific chatbots remain unexplored. this paper presents system prompt meta language (spml), a domain-specific language for refining prompts and monitoring the inputs to the llm-based chatbots. spml actively checks attack prompts, ensuring user inputs align with chatbot definitions to prevent malicious execution on the llm backbone, optimizing costs. it also streamlines chatbot definition crafting with programming language capabilities, overcoming natural language design challenges. additionally, we introduce a groundbreaking benchmark with 1.8k system prompts and 20k user inputs, offering the inaugural language and benchmark for chatbot definition evaluation. experiments across datasets demonstrate spml's proficiency in understanding attacker prompts, surpassing models like gpt-4, gpt-3.5, and llama. our data and codes are publicly available at: https://prompt-compiler.github.io/spml/.
Pengrui Han, Rafal Kocielnik, Adhithya Saravanan, Roy Jiang, Or Sharir, Anima Anandkumar
Abstract: large language models (llms), while powerful, exhibit harmful social biases. debiasing is often challenging due to computational costs, data constraints, and potential degradation of multi-task language capabilities. this work introduces a novel approach utilizing chatgpt to generate synthetic training data, aiming to enhance the debiasing of llms. we propose two strategies: targeted prompting, which provides effective debiasing for known biases but necessitates prior specification of bias in question; and general prompting, which, while slightly less effective, offers debiasing across various categories. we leverage resource-efficient llm debiasing using adapter tuning and compare the effectiveness of our synthetic data to existing debiasing datasets. our results reveal that: (1) chatgpt can efficiently produce high-quality training data for debiasing other llms; (2) data produced via our approach surpasses existing datasets in debiasing performance while also preserving internal knowledge of a pre-trained llm; and (3) synthetic data exhibits generalizability across categories, effectively mitigating various biases, including intersectional ones. these findings underscore the potential of synthetic data in advancing the fairness of llms with minimal retraining cost.
Alexander Wan, Eric Wallace, Dan Klein
Abstract: retrieval-augmented language models are being increasingly tasked with subjective, contentious, and conflicting queries such as "is aspartame linked to cancer". to resolve these ambiguous queries, one must search through a large range of websites and consider "which, if any, of this evidence do i find convincing?". in this work, we study how llms answer this question. in particular, we construct conflictingqa, a dataset that pairs controversial queries with a series of real-world evidence documents that contain different facts (e.g., quantitative results), argument styles (e.g., appeals to authority), and answers (yes or no). we use this dataset to perform sensitivity and counterfactual analyses to explore which text features most affect llm predictions. overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. taken together, these results highlight the importance of rag corpus quality (e.g., the need to filter misinformation), and possibly even a shift in how llms are trained to better align with human judgements.
Jinghao Zhang, Yuting Liu, Qiang Liu, Shu Wu, Guibing Guo, Liang Wang
Abstract: recently, the powerful large language models (llms) have been instrumental in propelling the progress of recommender systems (rs). however, while these systems have flourished, their susceptibility to security threats has been largely overlooked. in this work, we reveal that the introduction of llms into recommendation models presents new security vulnerabilities due to their emphasis on the textual content of items. we demonstrate that attackers can significantly boost an item's exposure by merely altering its textual content during the testing phase, without requiring direct interference with the model's training process. additionally, the attack is notably stealthy, as it does not affect the overall recommendation performance and the modifications to the text are subtle, making it difficult for users and platforms to detect. our comprehensive experiments across four mainstream llm-based recommendation models demonstrate the superior efficacy and stealthiness of our approach. our work unveils a significant security gap in llm-based recommendation systems and paves the way for future research on protecting these systems.

2024-02-17

Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, Xu Sun
Abstract: leveraging the rapid development of large language models llms, llm-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. it is crucial to ensure the reliability and security of llm-based agents during applications. however, the safety issues of llm-based agents are currently under-explored. in this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to llm-based agents. we first formulate a general framework of agent backdoor attacks, then we present a thorough analysis on the different forms of agent backdoor attacks. specifically, from the perspective of the final attacking outcomes, the attacker can either choose to manipulate the final output distribution, or only introduce malicious behavior in the intermediate reasoning process, while keeping the final output correct. furthermore, the former category can be divided into two subcategories based on trigger locations: the backdoor trigger can be hidden either in the user query or in an intermediate observation returned by the external environment. we propose the corresponding data poisoning mechanisms to implement the above variations of agent backdoor attacks on two typical agent tasks, web shopping and tool utilization. extensive experiments show that llm-based agents suffer severely from backdoor attacks, indicating an urgent need for further research on the development of defenses against backdoor attacks on llm-based agents. warning: this paper may contain biased content.
Xun Liang, Hanyu Wang, Shichao Song, Mengting Hu, Xunzhi Wang, Zhiyu Li, Feiyu Xiong, Bo Tang
Abstract: controlled text generation (ctg) aims to produce texts that exhibit specific desired attributes. in this study, we introduce a pluggable ctg framework for large language models (llms) named dynamic attribute graphs-based controlled text generation (datg). this framework utilizes an attribute scorer to evaluate the attributes of sentences generated by llms and constructs dynamic attribute graphs. datg modulates the occurrence of key attribute words and key anti-attribute words, achieving effective attribute control without compromising the original capabilities of the model. we conduct experiments across four datasets in two tasks: toxicity mitigation and sentiment transformation, employing five llms as foundational models. our findings highlight a remarkable enhancement in control accuracy, achieving a peak improvement of 19.29% over baseline methods in the most favorable task across four datasets. additionally, we observe a significant decrease in perplexity, markedly improving text fluency.
Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu
Abstract: to align large language models with human preferences, existing research either utilizes a separate reward model (rm) to perform on-policy learning or simplifies the training procedure by discarding the on-policy learning and the need for a separate rm. in this paper, we present a novel alignment framework, self-judge that is (1) on-policy learning and 2) parameter efficient, as it does not require an additional rm for evaluating the samples for on-policy learning. to this end, we propose judge-augmented supervised fine-tuning (jsft) to train a single model acting as both a policy and a judge. specifically, we view the pairwise judgment task as a special case of the instruction-following task, choosing the better response from a response pair. thus, the resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. experimental results show the efficacy of self-judge, outperforming baselines in preference benchmarks. we also show that self-rejection with oversampling can improve further without an additional evaluator. our code is available at https://github.com/oddqueue/self-judge.
Junlong Li, Fan Zhou, Shichao Sun, Yikai Zhang, Hai Zhao, Pengfei Liu
Abstract: as a relative quality comparison of model responses, human and large language model (llm) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. in this work, we dissect the preferences of human and 32 different llms to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. we find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. on the contrary, advanced llms like gpt-4-turbo emphasize correctness, clarity, and harmlessness more. additionally, llms of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only llms. finally, we show that preference-based evaluation can be intentionally manipulated. in both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. this results in notable score shifts: up to 0.59 on mt-bench (1-10 scale) and 31.94 on alpacaeval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. interactive demo: https://huggingface.co/spaces/gair/preference-dissection-visualization dataset: https://huggingface.co/datasets/gair/preference-dissection code: https://github.com/gair-nlp/preference-dissection
Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu
Abstract: the fairness and trustworthiness of large language models (llms) are receiving increasing attention. implicit hate speech, which employs indirect language to convey hateful intentions, occupies a significant portion of practice. however, the extent to which llms effectively address this issue remains insufficiently examined. this paper delves into the capability of llms to detect implicit hate speech (classification task) and express confidence in their responses (calibration task). our evaluation meticulously considers various prompt patterns and mainstream uncertainty estimation methods. our findings highlight that llms exhibit two extremes: (1) llms display excessive sensitivity towards groups or topics that may cause fairness issues, resulting in misclassifying benign statements as hate speech. (2) llms' confidence scores for each method excessively concentrate on a fixed range, remaining unchanged regardless of the dataset's complexity. consequently, the calibration performance is heavily reliant on primary classification accuracy. these discoveries unveil new limitations of llms, underscoring the need for caution when optimizing models to ensure they do not veer towards extremes. this serves as a reminder to carefully consider sensitivity and confidence in the pursuit of model fairness.
Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
Abstract: instruction-following vision large language models (vllms) have achieved significant progress recently on a variety of tasks. these approaches merge strong pre-trained vision models and large language models (llms). since these components are trained separately, the learned representations need to be aligned with joint training on additional image-language pairs. this procedure is not perfect and can cause the model to hallucinate - provide answers that do not accurately reflect the image, even when the core llm is highly factual and the vision backbone has sufficiently complete representations. in this work, we frame the hallucination problem as an alignment issue, tackle it with preference tuning. specifically, we propose povid to generate feedback data with ai models. we use ground-truth instructions as the preferred response and a two-stage approach to generate dispreferred data. first, we prompt gpt-4v to inject plausible hallucinations into the correct answer. second, we distort the image to trigger the inherent hallucination behavior of the vllm. this is an automated approach, which does not rely on human data generation or require a perfect expert, which makes it easily scalable. finally, both of these generation strategies are integrated into an rlhf pipeline via direct preference optimization. in experiments across broad benchmarks, we show that we can not only reduce hallucinations, but improve model performance across standard benchmarks, outperforming prior approaches. our data and code are available at https://github.com/yiyangzhou/povid.
Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang
Abstract: recent studies show that self-feedback improves large language models (llms) on certain tasks while worsens other tasks. we discovered that such a contrary is due to llm's bias towards their own output. in this paper, we formally define llm's self-bias -- the tendency to favor its own generation -- using two statistics. we analyze six llms on translation, constrained text generation, and mathematical reasoning tasks. we find that self-bias is prevalent in all examined llms across multiple languages and tasks. our analysis reveals that while the self-refine pipeline improves the fluency and understandability of model outputs, it further amplifies self-bias. to mitigate such biases, we discover that larger model size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks.
Shiyu Ni, Keping Bi, Jiafeng Guo, Xueqi Cheng
Abstract: large language models (llms) have been found to have difficulty knowing they do not possess certain knowledge and tend to provide specious answers in such cases. retrieval augmentation (ra) has been extensively studied to mitigate llms' hallucinations. however, due to the extra overhead and unassured quality of retrieval, it may not be optimal to conduct ra all the time. a straightforward idea is to only conduct retrieval when llms are uncertain about a question. this motivates us to enhance the llms' ability to perceive their knowledge boundaries to help ra. in this paper, we first quantitatively measure llms' such ability and confirm their overconfidence. then, we study how llms' certainty about a question correlates with their dependence on external retrieved information. we propose several methods to enhance llms' perception of knowledge boundaries and show that they are effective in reducing overconfidence. additionally, equipped with these methods, llms can achieve comparable or even better performance of ra with much fewer retrieval calls.

2024-02-16

Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
Abstract: reinforcement learning from human feedback (rlhf) is pivotal in aligning large language models (llms) with human preferences. while these aligned generative models have demonstrated impressive capabilities across various tasks, the dependence on high-quality human preference data poses a costly bottleneck in practical implementation of rlhf. hence better and adaptive strategies for data collection is needed. to this end, we frame rlhf as a contextual preference bandit problem with prompts as contexts and show that the naive way of collecting preference data by choosing prompts uniformly at random leads to a policy that suffers an $\omega(1)$ suboptimality gap in rewards. then we propose $\textit{active preference optimization}$ ($\texttt{apo}$), an algorithm that actively selects prompts to collect preference data. under the bradley-terry-luce (btl) preference model, \texttt{apo} achieves sample efficiency without compromising on policy performance. we show that given a sample budget of $t$, the suboptimality gap of a policy learned via $\texttt{apo}$ scales as $o(1/\sqrt{t})$. next, we propose a compute-efficient batch version of $\texttt{apo}$ with minor modification and evaluate its performance in practice. experimental evaluations on a human preference dataset validate \texttt{apo}'s efficacy as a sample-efficient and practical solution to data collection for rlhf, facilitating alignment of llms with human preferences in a cost-effective and scalable manner.
Yogesh Tripathi, Raghav Donakanti, Sahil Girhepuje, Ishan Kavathekar, Bhaskara Hanuma Vedula, Gokul S Krishnan, Shreya Goyal, Anmol Goel, Balaraman Ravindran, Ponnurangam Kumaraguru
Abstract: recent advancements in language technology and artificial intelligence have resulted in numerous language models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. in this study, we explore the ability of large language models (llms) to perform legal tasks in the indian landscape when social factors are involved. we present a novel metric, $\beta$-weighted $\textit{legal safety score ($lss_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the llm. we assess llms' safety by considering its performance in the $\textit{binary statutory reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the indian society. task performance and fairness scores of llama and llama--2 models indicate that the proposed $lss_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. we also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. the finetuning procedures on llama and llama--2 models increase the $lss_{\beta}$, improving their usability in the indian legal domain. our code is publicly released.
Afra Amini, Tim Vieira, Ryan Cotterell
Abstract: direct preference optimization (dpo) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. dpo, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. however, not all preference pairs are equal: while in some cases the preferred response is only slightly better than the dispreferred response, there can be a stronger preference for one response when, for example, the other response includes harmful or toxic content. in this paper, we propose a generalization of dpo, termed dpo with an offset (odpo), that does not treat every preference pair equally during fine-tuning. intuitively, odpo requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. the offset is determined based on the extent to which one response is preferred over another. our experiments on various tasks suggest that odpo significantly outperforms dpo in aligning language models, especially when the number of preference pairs is limited.
Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral
Abstract: large language models (llms) are aligned to moral and ethical guidelines but remain susceptible to creative prompts called jailbreak that can bypass the alignment process. however, most jailbreaking prompts contain harmful questions in the natural language (mainly english), which can be detected by the llm themselves. in this paper, we present jailbreaking prompts encoded using cryptographic techniques. we first present a pilot study on the state-of-the-art llm, gpt-4, in decoding several safe sentences that have been encrypted using various cryptographic techniques and find that a straightforward word substitution cipher can be decoded most effectively. motivated by this result, we use this encoding technique for writing jailbreaking prompts. we present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words. experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models including chatgpt, gpt-4, and gemini-pro. additionally, we discuss the over-defensiveness of these models. we believe that our work will encourage further research in making these llms more robust while maintaining their decoding capabilities.
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
Abstract: hallucinations pose a significant challenge for the practical implementation of large language models (llms). the utilization of parametric knowledge in generating factual content is constrained by the limited knowledge of llms, potentially resulting in internal hallucinations. while incorporating external information can help fill knowledge gaps, it also introduces the risk of irrelevant information, thereby increasing the likelihood of external hallucinations. a careful and balanced integration of the parametric knowledge within llms with external information is crucial to alleviate hallucinations. in this study, we present rowen, a novel approach that enhances llms with a selective retrieval augmentation process tailored to address hallucinated outputs. this process is governed by a multilingual semantic-aware detection module, which evaluates the consistency of the perturbed responses across various languages for the same queries. upon detecting inconsistencies indicative of hallucinations, rowen activates the retrieval of external information to rectify the model outputs. rowen adeptly harmonizes the intrinsic parameters in llms with external knowledge sources, effectively mitigating hallucinations by ensuring a balanced integration of internal reasoning and external evidence. through a comprehensive empirical analysis, we demonstrate that rowen surpasses the current state-of-the-art in both detecting and mitigating hallucinated content within the outputs of llms.
Ming Li, Jiuhai Chen, Lichang Chen, Tianyi Zhou
Abstract: making llms speak for different, especially minority groups of people, and generate statements supporting their diverse or even controversial perspectives is critical to creating an inclusive environment. however, existing llms lack sufficient controllability to the stance of their generated content, which often contains inconsistent, neutral, or biased statements. in this paper, we improve the controllability of llms in generating statements supporting an argument the user defined in the prompt. we find that multi-round debates between two llms with opposite stances generate higher-quality and more salient statements for each, which are important training data to improve the controllability of llms. motivated by this, we develop a novel debate & tuning ("debatune") pipeline finetuning llms to generate the statements obtained via debate. to examine debatune, we curate the largest dataset of debate topics so far, which covers 710 controversial topics and corresponding arguments for each topic. evaluations by the gpt-4 judge with a novel controversy controllability metric show that llms' capability of expressing diverse perspectives is significantly improved by debatune. moreover, such controllability can be generalized to unseen topics, generating high-quality statements supporting controversial arguments. our codes, models, and data will be released at https://github.com/tianyi-lab/debatune.
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang
Abstract: adopting human and large language models (llm) as judges (\textit{a.k.a} human- and llm-as-a-judge) for evaluating the performance of existing llms has recently gained attention. nonetheless, this approach concurrently introduces potential biases from human and llm judges, questioning the reliability of the evaluation results. in this paper, we propose a novel framework for investigating 5 types of biases for llm and human judges. we curate a dataset with 142 samples referring to the revised bloom's taxonomy and conduct thousands of human and llm evaluations. results show that human and llm judges are vulnerable to perturbations to various degrees, and that even the most cutting-edge judges possess considerable biases. we further exploit their weakness and conduct attacks on llm judges. we hope that our work can notify the community of the vulnerability of human- and llm-as-a-judge against perturbations, as well as the urgency of developing robust evaluation systems.
Haiyan Zhao, Fan Yang, Himabindu Lakkaraju, Mengnan Du
Abstract: as large language models (llms) grow more powerful, concerns around potential harms like toxicity, unfairness, and hallucination threaten user trust. ensuring beneficial alignment of llms with human values through model alignment is thus critical yet challenging, requiring a deeper understanding of llm behaviors and mechanisms. we propose opening the black box of llms through a framework of holistic interpretability encompassing complementary bottom-up and top-down perspectives. the bottom-up view, enabled by mechanistic interpretability, focuses on component functionalities and training dynamics. the top-down view utilizes representation engineering to analyze behaviors through hidden representations. in this paper, we review the landscape around mechanistic interpretability and representation engineering, summarizing approaches, discussing limitations and applications, and outlining future challenges in using these techniques to achieve ethical, honest, and reliable reasoning aligned with human values.
Junjie Ye, Sixian Li, Guanyu Li, Caishuang Huang, Songyang Gao, Yilong Wu, Qi Zhang, Tao Gui, Xuanjing Huang
Abstract: tool learning is widely acknowledged as a foundational approach or deploying large language models (llms) in real-world scenarios. while current research primarily emphasizes leveraging tools to augment llms, it frequently neglects emerging safety considerations tied to their application. to fill this gap, we present $toolsword$, a comprehensive framework dedicated to meticulously investigating safety issues linked to llms in tool learning. specifically, toolsword delineates six safety scenarios for llms in tool learning, encompassing $malicious$ $queries$ and $jailbreak$ $attacks$ in the input stage, $noisy$ $misdirection$ and $risky$ $cues$ in the execution stage, and $harmful$ $feedback$ and $error$ $conflicts$ in the output stage. experiments conducted on 11 open-source and closed-source llms reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even gpt-4 is susceptible to. moreover, we conduct further studies with the aim of fostering research on tool learning safety. the data is released in https://github.com/junjie-ye/toolsword.
Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
Abstract: the diversity of contexts in which large language models (llms) are deployed requires the ability to modify or customize default model behaviors to incorporate nuanced requirements and preferences. a convenient interface to specify such model adjustments is high-level verbal feedback, such as "don't use emojis when drafting emails to my boss." however, while writing high-level feedback is far simpler than collecting annotations for reinforcement learning from human feedback (rlhf), we find that simply prompting a model with such feedback leads to overgeneralization of the feedback to contexts where it is not relevant. we study the problem of incorporating verbal feedback without such overgeneralization, inspiring a new method contextualized critiques with constrained preference optimization (c3po). c3po uses a piece of high-level feedback to generate a small synthetic preference dataset specifying how the feedback should (and should not) be applied. it then fine-tunes the model in accordance with the synthetic preference data while minimizing the divergence from the original model for prompts where the feedback does not apply. our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts. for both human- and gpt-4-generated high-level feedback, c3po effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%.
Sarath Sivaprasad, Pramod Kaushik, Sahar Abdelnabi, Mario Fritz
Abstract: large-language-models (llms) are deployed in a wide range of applications, and their response has an increasing social impact. understanding the non-deliberate(ive) mechanism of llms in giving responses is essential in explaining their performance and discerning their biases in real-world applications. this is analogous to human studies, where such inadvertent responses are referred to as sampling. we study this sampling of llms in light of value bias and show that the sampling of llms tends to favour high-value options. value bias corresponds to this shift of response from the most likely towards an ideal value represented in the llm. in fact, this effect can be reproduced even with new entities learnt via in-context prompting. we show that this bias manifests in unexpected places and has implications on relevant application scenarios, like choosing exemplars. the results show that value bias is strong in llms across different categories, similar to the results found in human studies.
Jingwei Ni, Minjing Shi, Dominik Stammbach, Mrinmaya Sachan, Elliott Ash, Markus Leippold
Abstract: with the rise of generative ai, automated fact-checking methods to combat misinformation are becoming more and more important. however, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. to address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. to address (2), we introduce afacta (automatic factual claim detection annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (llms). afacta calibrates its annotation confidence with consistency along three predefined reasoning paths. extensive evaluation and experiments in the domain of political speech reveal that afacta can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. our analyses also result in policlaim, a comprehensive claim detection dataset spanning diverse political topics.
Chris M. Ward, Josh Harguess, Julia Tao, Daniel Christman, Paul Spicer, Mike Tan
Abstract: we introduce the ai security pyramid of pain, a framework that adapts the cybersecurity pyramid of pain to categorize and prioritize ai-specific threats. this framework provides a structured approach to understanding and addressing various levels of ai threats. starting at the base, the pyramid emphasizes data integrity, which is essential for the accuracy and reliability of datasets and ai models, including their weights and parameters. ensuring data integrity is crucial, as it underpins the effectiveness of all ai-driven decisions and operations. the next level, ai system performance, focuses on mlops-driven metrics such as model drift, accuracy, and false positive rates. these metrics are crucial for detecting potential security breaches, allowing for early intervention and maintenance of ai system integrity. advancing further, the pyramid addresses the threat posed by adversarial tools, identifying and neutralizing tools used by adversaries to target ai systems. this layer is key to staying ahead of evolving attack methodologies. at the adversarial input layer, the framework addresses the detection and mitigation of inputs designed to deceive or exploit ai models. this includes techniques like adversarial patterns and prompt injection attacks, which are increasingly used in sophisticated attacks on ai systems. data provenance is the next critical layer, ensuring the authenticity and lineage of data and models. this layer is pivotal in preventing the use of compromised or biased data in ai systems. at the apex is the tactics, techniques, and procedures (ttps) layer, dealing with the most complex and challenging aspects of ai security. this involves a deep understanding and strategic approach to counter advanced ai-targeted attacks, requiring comprehensive knowledge and planning.
Zihao He, Siyi Guo, Ashwin Rao, Kristina Lerman
Abstract: language models (lms) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. to explore how lms represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. however, human communication also encompasses emotional and moral dimensions. we define the problem of affective alignment, which measures how lms' emotional and moral tone represents those of different groups. by comparing the affect of responses generated by 36 lms to the affect of twitter messages, we observe significant misalignment of lms with both ideological groups. this misalignment is larger than the partisan divide in the u.s. even after steering the lms towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within lms.
Tianyi Yan, Fei Wang, James Y. Huang, Wenxuan Zhou, Fan Yin, Aram Galstyan, Wenpeng Yin, Muhao Chen
Abstract: instruction tuning has been used as a promising approach to improve the performance of large language models (llms) on unseen tasks. however, current llms exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. this behavior indicates llms' lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. accordingly, we propose contrastive instruction tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. to facilitate this approach, we augment the existing flan collection by paraphrasing task instructions. experiments on the promptbench benchmark show that coin consistently improves llms' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
Fan Huang, Haewoon Kwak, Jisun An
Abstract: the robustness of ai-content detection models against cultivated attacks (e.g., paraphrasing or word switching) remains a significant concern. this study proposes a novel token-ensemble generation strategy to challenge the robustness of current ai-content detection approaches. we explore the ensemble attack strategy by completing the prompt with the next token generated from random candidate llms. we find the token-ensemble approach significantly drops the performance of ai-content detection models (the code and test sets will be released). our findings reveal that token-ensemble generation poses a vital challenge to current detection models and underlines the need for advancing detection technologies to counter sophisticated adversarial strategies.
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Abstract: the advent of large language models (llms) has brought an unprecedented surge in machine-generated text (mgt) across diverse channels. this raises legitimate concerns about its potential misuse and societal implications. the need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. in this work, we address this problem by introducing a new benchmark involving multilingual, multi-domain and multi-generator for mgt detection -- m4gt-bench. it is collected for three task formulations: (1) mono-lingual and multi-lingual binary mgt detection; (2) multi-way detection identifies which particular model generates the text; and (3) human-machine mixed text detection, where a word boundary delimiting mgt from human-written content should be determined. human evaluation for task 2 shows less than random guess performance, demonstrating the challenges to distinguish unique llms. promising results always occur when training and test data distribute within the same domain or generators.
Haolan Zhan, Zhuang Li, Xiaoxi Kang, Tao Feng, Yuncheng Hua, Lizhen Qu, Yi Ying, Mei Rianto Chandra, Kelly Rosalin, Jureynolds Jureynolds, Suraj Sharma, Shilin Qu, Linhao Luo, Lay-Ki Soon, Zhaleh Semnani Azad, Ingrid Zukerman, Gholamreza Haffari
Abstract: norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. to equip interactive ai systems with a remediation ability, we offer renovi - a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. renovi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by chatgpt through prompt learning. while collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between llms and humans in the awareness of social norms. we thus harness the power of chatgpt to generate synthetic training data for our task. to ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.
Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee
Abstract: large language models (llms) can generate biased responses. yet previous direct probing techniques contain either gender mentions or predefined gender stereotypes, which are challenging to comprehensively collect. hence, we propose an indirect probing framework based on conditional generation. this approach aims to induce llms to disclose their gender bias even without explicit gender or stereotype mentions. we explore three distinct strategies to disclose explicit and implicit gender bias in llms. our experiments demonstrate that all tested llms exhibit explicit and/or implicit gender bias, even when gender stereotypes are not present in the inputs. in addition, an increased model size or model alignment amplifies bias in most cases. furthermore, we investigate three methods to mitigate bias in llms via hyperparameter tuning, instruction guiding, and debias tuning. remarkably, these methods prove effective even in the absence of explicit genders or stereotypes.

2024-02-15

Ece Gumusel, Kyrie Zhixuan Zhou, Madelyn Rose Sanfilippo
Abstract: this study presents a unique framework that applies and extends solove (2006)'s taxonomy to address privacy concerns in interactions with text-based ai chatbots. as chatbot prevalence grows, concerns about user privacy have heightened. while existing literature highlights design elements compromising privacy, a comprehensive framework is lacking. through semi-structured interviews with 13 participants interacting with two ai chatbots, this study identifies 9 privacy harms and 9 privacy risks in text-based interactions. using a grounded theory approach for interview and chatlog analysis, the framework examines privacy implications at various interaction stages. the aim is to offer developers, policymakers, and researchers a tool for responsible and secure implementation of conversational ai, filling the existing gap in addressing privacy issues associated with text-based ai chatbots.
Ashfak Md Shibli, Mir Mehedi A. Pritom, Maanak Gupta
Abstract: sms phishing, also known as "smishing", is a growing threat that tricks users into disclosing private information or clicking into urls with malicious content through fraudulent mobile text messages. in recent past, we have also observed a rapid advancement of conversational generative ai chatbot services (e.g., openai's chatgpt, google's bard), which are powered by pre-trained large language models (llms). these ai chatbots certainly have a lot of utilities but it is not systematically understood how they can play a role in creating threats and attacks. in this paper, we propose abusegpt method to show how the existing generative ai-based chatbot services can be exploited by attackers in real world to create smishing texts and eventually lead to craftier smishing campaigns. to the best of our knowledge, there is no pre-existing work that evidently shows the impacts of these generative text-based models on creating sms phishing. thus, we believe this study is the first of its kind to shed light on this emerging cybersecurity threat. we have found strong empirical evidences to show that attackers can exploit ethical standards in the existing generative ai-based chatbot services by crafting prompt injection attacks to create newer smishing campaigns. we also discuss some future research directions and guidelines to protect the abuse of generative ai-based services and safeguard users from smishing attacks.
Paulo Garcia
Abstract: ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. this paper describes oblivious agents: agents that are architected in such a way that their effective utility function is an aggregation of a known and hidden sub-functions. the hidden component, to be maximized, is internally implemented as a black box, preventing the agent from examining it. the known component, to be minimized, is knowledge of the hidden sub-function. architectural constraints further influence how agent actions can evolve its internal environment model. we show that an oblivious agent, behaving rationally, constructs an internal approximation of designers' intentions (i.e., infers alignment), and, as a consequence of its architecture and effective utility function, behaves in such a way that maximizes alignment; i.e., maximizing the approximated intention function. we show that, paradoxically, it does this for whatever utility function is used as the hidden component and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows.
Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu
Abstract: deep reinforcement learning is widely used for aligning large language models (llm) with human preference. however, the conventional reward modelling has predominantly depended on human annotations provided by a select cohort of individuals. such dependence may unintentionally result in models that are skewed to reflect the inclinations of these annotators, thereby failing to represent the expectations of the wider population adequately. in this paper, we introduce the distributional preference reward model (dprm), a simple yet effective framework to align large language models with a diverse set of human preferences. to this end, we characterize the preferences by a beta distribution, which can dynamically adapt to fluctuations in preference trends. on top of that, we design an optimal-transportation-based loss to calibrate dprm to align with the preference distribution. finally, the expected reward is utilized to fine-tune an llm policy to generate responses favoured by the population. our experiments show that dprm significantly enhances the alignment of llms with population preference, yielding more accurate, unbiased, and contextually appropriate responses.
Shangyu Xing, Fei Zhao, Zhen Wu, Tuo An, Weihao Chen, Chunhui Li, Jianbing Zhang, Xinyu Dai
Abstract: multimodal large language models (mllms) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. to eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. however, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. to address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (efuf), which can eliminate hallucinations without the need for paired data. extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. our code and datasets will be publicly available.
Álvaro Huertas-García, Alejandro Martín, Javier Huertas-Tato, David Camacho
Abstract: adversarial attacks represent a substantial challenge in natural language processing (nlp). this study undertakes a systematic exploration of this challenge in two distinct phases: vulnerability evaluation and resilience enhancement of transformer-based models under adversarial attacks. in the evaluation phase, we assess the susceptibility of three transformer configurations, encoder-decoder, encoder-only, and decoder-only setups, to adversarial attacks of escalating complexity across datasets containing offensive language and misinformation. encoder-only models manifest a 14% and 21% performance drop in offensive language detection and misinformation detection tasks, respectively. decoder-only models register a 16% decrease in both tasks, while encoder-decoder models exhibit a maximum performance drop of 14% and 26% in the respective tasks. the resilience-enhancement phase employs adversarial training, integrating pre-camouflaged and dynamically altered data. this approach effectively reduces the performance drop in encoder-only models to an average of 5% in offensive language detection and 2% in misinformation detection tasks. decoder-only models, occasionally exceeding original performance, limit the performance drop to 7% and 2% in the respective tasks. although not surpassing the original performance, encoder-decoder models can reduce the drop to an average of 6% and 2% respectively. results suggest a trade-off between performance and robustness, with some models maintaining similar performance while gaining robustness. our study and adversarial training techniques have been incorporated into an open-source tool for generating camouflaged datasets. however, methodology effectiveness depends on the specific camouflage technique and data encountered, emphasizing the need for continued exploration.
Timothy R. Mcintosh, Teo Susnjak, Tong Liu, Paul Watters, Malka N. Halgamuge
Abstract: the rapid rise in popularity of large language models (llms) with emerging capabilities has spurred public curiosity to evaluate and compare different llms, leading many researchers to propose their llm benchmarks. noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art llm benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of artificial intelligence (ai) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture llms' complex behaviors and potential risks. our study highlighted the necessity for a paradigm shift in llm evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of ai systems' integration into society.
Saeed Khaki, Jinjin Li, Lan Ma, Liu Yang, Prathap Ramachandra
Abstract: reinforcement learning from human feedback (rlhf) has been extensively employed to align large language models with user intent. however, proximal policy optimization (ppo) based rlhf is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. recently, direct preference optimization (dpo) is proposed to address those challenges. however, dpo relies on contrastive responses generated from human annotator and alternative llm, instead of the policy model, limiting the effectiveness of the rlhf. in this paper, we addresses both challenges by systematically combining rejection sampling (rs) and dpo. our proposed method, rs-dpo, initiates with the development of a supervised fine-tuned policy model (sft). a varied set of k responses per prompt are sampled directly from the sft model. rs-dpo identifies pairs of contrastive samples based on their reward distribution. finally, we apply dpo with the contrastive samples to align the model to human preference. our experiments indicate that our proposed method effectively fine-tunes llms with limited resource environments, leading to improved alignment with user intent. furthermore, it outperforms existing methods, including rs, ppo, and dpo.
Zheyuan Liu, Guangyao Dou, Zhaoxuan Tan, Yijun Tian, Meng Jiang
Abstract: the rapid advancement of large language models (llms) has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. however, llms often encounter challenges in generating harmful content when faced with problematic prompts. to address this problem, existing work attempted to implement a gradient ascent based approach to prevent llms from producing harmful output. while these methods can be effective, they frequently impact the model utility in responding to normal prompts. to address this gap, we introduce selective knowledge negation unlearning (sku), a novel unlearning framework for llms, designed to eliminate harmful knowledge while preserving utility on normal prompts. specifically, sku is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. the first stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. sku selectively isolates and removes harmful knowledge in model parameters, ensuring the model's performance remains robust on normal prompts. our experiments conducted across various llm architectures demonstrate that sku identifies a good balance point between removing harmful information and preserving utility.
Weixiang Zhao, Zhuojun Li, Shilong Wang, Yang Wang, Yulin Hu, Yanyan Zhao, Chen Wei, Bing Qin
Abstract: emotional intelligence (ei), consisting of emotion perception, emotion cognition and emotion expression, plays the critical roles in improving user interaction experience for the current large language model (llm) based conversational general ai assistants. previous works mainly focus on raising the emotion perception ability of them via naive fine-tuning on ei-related classification or regression tasks. however, this leads to the incomplete enhancement of ei and catastrophic forgetting of the general intelligence (gi). to this end, we first introduce \textsc{eibench}, a large-scale collection of ei-related tasks in the text-to-text formation with task instructions that covers all three aspects of ei, which lays a solid foundation for the comprehensive ei enhancement of llms. then a novel \underline{\textbf{mo}}dular \underline{\textbf{e}}motional \underline{\textbf{i}}ntelligence enhancement method (\textbf{moei}), consisting of modular parameter expansion and intra-inter modulation, is proposed to comprehensively enhance the ei of llms without compromise their gi. extensive experiments on two representative llm-based assistants, flan-t5 and llama-2-chat, demonstrate the effectiveness of moei to improving ei while maintain gi.
Lingbo Mo, Zeyi Liao, Boyuan Zheng, Yu Su, Chaowei Xiao, Huan Sun
Abstract: language agents powered by large language models (llms) have seen exploding development. their capability of using language as a vehicle for thought and communication lends an incredible level of flexibility and versatility. people have quickly capitalized on this capability to connect llms to a wide range of external components and environments: databases, tools, the internet, robotic embodiment, etc. many believe an unprecedentedly powerful automation technology is emerging. however, new automation technologies come with new safety risks, especially for intricate systems like language agents. there is a surprisingly large gap between the speed and scale of their development and deployment and our understanding of their safety risks. are we building a house of cards? in this position paper, we present the first systematic effort in mapping adversarial attacks against language agents. we first present a unified conceptual framework for agents with three major components: perception, brain, and action. under this framework, we present a comprehensive discussion and propose 12 potential attack scenarios against different components of an agent, covering different attack strategies (e.g., input manipulation, adversarial demonstrations, jailbreaking, backdoors). we also draw connections to successful attack strategies previously applied to llms. we emphasize the urgency to gain a thorough understanding of language agent risks before their widespread deployment.
Rui Yang, Xiaoman Pan, Feng Luo, Shuang Qiu, Han Zhong, Dong Yu, Jianshu Chen
Abstract: we consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless ai systems. however, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (rl), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. in this paper, we introduce rewards-in-context (ric), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. the salient features of ric are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the pareto-optimal solution for multiple objectives. empirical evidence demonstrates the efficacy of our method in aligning both large language models (llms) and diffusion models to accommodate diverse rewards with only around 10% gpu hours compared with multi-objective rl baseline.
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, Sam Toyer
Abstract: the rise of large language models (llms) has drawn attention to the existence of "jailbreaks" that allow the models to be used maliciously. however, there is no standard benchmark for measuring the severity of a jailbreak, leaving authors of jailbreak papers to create their own. we show that these benchmarks often include vague or unanswerable questions and use grading criteria that are biased towards overestimating the misuse potential of low-quality model responses. some jailbreak techniques make the problem worse by decreasing the quality of model responses even on benign questions: we show that several jailbreaking techniques substantially reduce the zero-shot performance of gpt-4 on mmlu. jailbreaks can also make it harder to elicit harmful responses from an "uncensored" open-source model. we present a new benchmark, strongreject, which better discriminates between effective and ineffective jailbreaks by using a higher-quality question set and a more accurate response grading algorithm. we show that our new grading scheme better accords with human judgment of response quality and overall jailbreak effectiveness, especially on the sort of low-quality responses that contribute the most to over-estimation of jailbreak performance on existing benchmarks. we release our code and data at https://github.com/alexandrasouly/strongreject.
Xiyang Wu, Ruiqi Xian, Tianrui Guan, Jing Liang, Souradip Chakraborty, Fuxiao Liu, Brian Sadler, Dinesh Manocha, Amrit Singh Bedi
Abstract: in this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (llms) and vision-language models (vlms) into robotics applications. recent works have focused on using llms and vlms to improve the performance of robotics tasks, such as manipulation, navigation, etc. however, such integration can introduce significant vulnerabilities, in terms of their susceptibility to adversarial attacks due to the language models, potentially leading to catastrophic consequences. by examining recent works at the interface of llms/vlms and robotics, we show that it is easy to manipulate or misguide the robot's actions, leading to safety hazards. we define and provide examples of several plausible adversarial attacks, and conduct experiments on three prominent robot frameworks integrated with a language model, including knowno vima, and instruct2act, to assess their susceptibility to these attacks. our empirical findings reveal a striking vulnerability of llm/vlm-robot integrated systems: simple adversarial attacks can significantly undermine the effectiveness of llm/vlm-robot integrated systems. specifically, our data demonstrate an average performance deterioration of 21.2% under prompt attacks and a more alarming 30.2% under perception attacks. these results underscore the critical need for robust countermeasures to ensure the safe and reliable deployment of the advanced llm/vlm-based robotic systems.
Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu
Abstract: llm hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of llms. the first step towards solving this complicated problem is to measure it. however, existing hallucination metrics require to have a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. such requirement makes hallucination measurement costly and prone to human errors. in this work, we propose factualness evaluations via weighting llms (fewl), the first hallucination metric that is specifically designed for the scenario when gold-standard answers are absent. fewl leverages the answers from off-the-shelf llms that serve as a proxy of gold-standard answers. the key challenge is how to quantify the expertise of reference llms resourcefully. we show fewl has certain theoretical guarantees and demonstrate empirically it gives more accurate hallucination measures than naively using reference llms. we also show how to leverage fewl to reduce hallucination through both in-context learning and supervised finetuning. last, we build a large-scale benchmark dataset to facilitate llm hallucination research.
Herun Wan, Shangbin Feng, Zhaoxuan Tan, Heng Wang, Yulia Tsvetkov, Minnan Luo
Abstract: large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. in this work, we propose dell that identifies three key stages in misinformation detection where llms could be incorporated as part of the pipeline: 1) llms could \emph{generate news reactions} to represent diverse perspectives and simulate user-news interaction networks; 2) llms could \emph{generate explanations} for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) llms could \emph{merge task-specific experts} and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. extensive experiments on seven datasets with three llms demonstrate that dell outperforms state-of-the-art baselines by up to 16.8\% in macro f1-score. further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed llm-guided expert merging helps produce better-calibrated predictions.
Wenchao Dong, Assem Zhunis, Hyojin Chin, Jiyoung Han, Meeyoung Cha
Abstract: we explored cultural biases-individualism vs. collectivism-in chatgpt across three western languages (i.e., english, german, and french) and three eastern languages (i.e., chinese, japanese, and korean). when chatgpt adopted an individualistic persona in western languages, its collectivism scores (i.e., out-group values) exhibited a more negative trend, surpassing their positive orientation towards individualism (i.e., in-group values). conversely, when a collectivistic persona was assigned to chatgpt in eastern languages, a similar pattern emerged with more negative responses toward individualism (i.e., out-group values) as compared to collectivism (i.e., in-group values). the results indicate that when imbued with a particular social identity, chatgpt discerns in-group and out-group, embracing in-group values while eschewing out-group values. notably, the negativity towards the out-group, from which prejudices and discrimination arise, exceeded the positivity towards the in-group. the experiment was replicated in the political domain, and the results remained consistent. furthermore, this replication unveiled an intrinsic democratic bias in large language models (llms), aligning with earlier findings and providing integral insights into mitigating such bias through prompt engineering. extensive robustness checks were performed using varying hyperparameter and persona setup methods, with or without social identity labels, across other popular language models.

2024-02-14

Siwon Kim, Shuyang Dai, Mohammad Kachuee, Shayan Ray, Tara Taghavi, Sungroh Yoon
Abstract: current conversational ai systems based on large language models (llms) are known to generate unsafe responses, agreeing to offensive user input or including toxic content. previous research aimed to alleviate the toxicity, by fine-tuning llm with manually annotated safe dialogue histories. however, the dependency on additional tuning requires substantial costs. to remove the dependency, we propose groundial, where response safety is achieved by grounding responses to commonsense social rules without requiring fine-tuning. a hybrid approach of in-context learning and human-norm-guided decoding of groundial enables the response to be quantitatively and qualitatively safer even without additional data or tuning.
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran
Abstract: as large language models (llms) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align llm behavior with human values, including safety. jailbreak attacks, aiming to provoke unintended and unsafe behaviors from llms, remain a significant/leading llm safety threat. in this paper, we aim to defend llms against jailbreak attacks by introducing safedecoding, a safety-aware decoding strategy for llms to generate helpful and harmless responses to user queries. our insight in developing safedecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. this allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. we perform extensive experiments on five llms using six state-of-the-art jailbreak attacks and four benchmark datasets. our results show that safedecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. safedecoding outperforms six defense methods.
Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann
Abstract: current research in adversarial robustness of llms focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. however, this approach neglects the steady progression of open-source models. as open-source models advance in capability, ensuring their safety also becomes increasingly imperative. yet, attacks tailored to open-source llms that exploit full model access remain largely unexplored. we address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. we find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned llms across multiple datasets and models. our findings highlight embedding space attacks as an important threat model in open-source llms. trigger warning: the appendix contains llm-generated text with violence and harassment.
Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
Abstract: with the development of llms, the security threats of llms are getting more and more attention. numerous jailbreak attacks have been proposed to assess the security defense of llms. current jailbreak attacks primarily utilize scenario camouflage techniques. however their explicitly mention of malicious intent will be easily recognized and defended by llms. in this paper, we propose an indirect jailbreak attack approach, puzzler, which can bypass the llm's defense strategy and obtain malicious response by implicitly providing llms with some clues about the original malicious query. in addition, inspired by the wisdom of "when unable to attack, defend" from sun tzu's art of war, we adopt a defensive stance to gather clues about the original malicious query through llms. extensive experimental results show that puzzler achieves a query success rate of 96.6% on closed-source llms, which is 57.9%-82.7% higher than baselines. furthermore, when tested against the state-of-the-art jailbreak detection approaches, puzzler proves to be more effective at evading detection compared to baselines.
Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting
Abstract: the proliferation of large language models (llms) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. while previous research delved into the security and privacy issues of llms, the extent to which these models can exhibit adversarial behavior remains largely unexplored. addressing this gap, we investigate whether common publicly available llms have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. more specifically, we investigate whether llms are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. our experiments, which focus on hate speech detection, reveal that llms succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. our findings carry significant implications for (semi-)autonomous systems relying on llms, highlighting potential challenges in their interaction with existing systems and safety measures.
Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, Stephan Günnemann
Abstract: current llm alignment methods are readily broken through specifically crafted adversarial prompts. while crafting adversarial prompts using discrete optimization is highly effective, such attacks typically use more than 100,000 llm calls. this high computational cost makes them unsuitable for, e.g., quantitative analyses and adversarial training. to remedy this, we revisit projected gradient descent (pgd) on the continuously relaxed input prompt. although previous attempts with ordinary gradient-based attacks largely failed, we show that carefully controlling the error introduced by the continuous relaxation tremendously boosts their efficacy. our pgd for llms is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
Abstract: large language models (llms) are susceptible to jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. as defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for jailbreaking attacks. in this work, inspired by human practices of indirect context to elicit harmful information, we focus on a new attack form called contextual interaction attack. the idea relies on the autoregressive nature of the generation process in llms. we contend that the prior context--the information preceding the attack query--plays a pivotal role in enabling potent jailbreaking attacks. specifically, we propose an approach that leverages preliminary question-answer pairs to interact with the llm. by doing so, we guide the responses of the model toward revealing the 'desired' harmful information. we conduct experiments on four different llms and demonstrate the efficacy of this attack, which is black-box and can also transfer across llms. we believe this can lead to further developments and understanding of the context vector in llms.
Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang
Abstract: the increasing demand for customized large language models (llms) has led to the development of solutions like gpts. these solutions facilitate tailored llm creation via natural language prompts without coding. however, the trustworthiness of third-party custom versions of llms remains an essential concern. in this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized llms (e.g., gpts). specifically, these attacks embed the backdoor into the custom version of llms by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. we stress that our attacks do not require fine-tuning or any modification to the backend llms, adhering strictly to gpts development guidelines. we conduct extensive experiments on 4 prominent llms and 5 benchmark text classification datasets. the results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. additionally, we propose an instruction-ignoring defense mechanism and demonstrate its partial effectiveness in mitigating such attacks. our findings highlight the vulnerability and the potential risks of llm customization such as gpts.
Olivia Macmillan-Scott, Mirco Musolesi
Abstract: do large language models (llms) display rational reasoning? llms have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. in this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. we find that, like humans, llms display irrationality in these tasks. however, the way this irrationality is displayed does not reflect that shown by humans. when incorrect answers are given by llms to these tasks, they are often incorrect in ways that differ from human-like biases. on top of this, the llms reveal an additional layer of irrationality in the significant inconsistency of the responses. aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.
Yuhui Shi, Qiang Sheng, Juan Cao, Hao Mi, Beizhe Hu, Danding Wang
Abstract: with the rapidly increasing application of large language models (llms), their abuse has caused many undesirable societal problems such as fake news, academic dishonesty, and information pollution. this makes ai-generated text (aigt) detection of great importance. among existing methods, white-box methods are generally superior to black-box methods in terms of performance and generalizability, but they require access to llms' internal states and are not applicable to black-box settings. in this paper, we propose to estimate word generation probabilities as pseudo white-box features via multiple re-sampling to help improve aigt detection under the black-box setting. specifically, we design poger, a proxy-guided efficient re-sampling method, which selects a small subset of representative words (e.g., 10 words) for performing multiple re-sampling in black-box aigt detection. experiments on datasets containing texts from humans and seven llms show that poger outperforms all baselines in macro f1 under black-box, partial white-box, and out-of-distribution settings and maintains lower re-sampling costs than its existing counterparts.
Zilin Ma, Yiyang Mei, Yinru Long, Zhaoyuan Su, Krzysztof Z. Gajos
Abstract: lgbtq+ individuals are increasingly turning to chatbots powered by large language models (llms) to meet their mental health needs. however, little research has explored whether these chatbots can adequately and safely provide tailored support for this demographic. we interviewed 18 lgbtq+ and 13 non-lgbtq+ participants about their experiences with llm-based chatbots for mental health needs. lgbtq+ participants relied on these chatbots for mental health support, likely due to an absence of support in real life. notably, while llms offer prompt support, they frequently fall short in grasping the nuances of lgbtq-specific challenges. although fine-tuning llms to address lgbtq+ needs can be a step in the right direction, it isn't the panacea. the deeper issue is entrenched in societal discrimination. consequently, we call on future researchers and designers to look beyond mere technical refinements and advocate for holistic strategies that confront and counteract the societal biases burdening the lgbtq+ community.
Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng
Abstract: despite showing increasingly human-like abilities, large language models (llms) often struggle with factual inaccuracies, i.e. "hallucinations", even when they hold relevant knowledge. to address these hallucinations, current approaches typically necessitate high-quality human factuality annotations. in this work, we explore self-alignment for factuality, where we leverage the self-evaluation capability of an llm to provide training signals that steer the model towards factuality. specifically, we incorporate self-eval, a self-evaluation component, to prompt an llm to validate the factuality of its own generated responses solely based on its internal knowledge. additionally, we design self-knowledge tuning (sk-tuning) to augment the llm's self-evaluation ability by improving the model's confidence estimation and calibration. we then utilize these self-annotated responses to fine-tune the model via direct preference optimization algorithm. we show that the proposed self-alignment approach substantially enhances factual accuracy over llama family models across three key knowledge-intensive tasks on truthfulqa and biogen.
Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao
Abstract: large language models (llms) are now commonplace in conversation applications. however, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on llm conversation safety. therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of llm conversation safety: attacks, defenses, and evaluations. our goal is to provide a structured summary that enhances understanding of llm conversation safety and encourages further investigation into this important subject. for easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at: https://github.com/niconi19/llm-conversation-safety.
Jessica Zhu, Dr. Michel Cukier, Dr. Joseph Richardson
Abstract: objective: firearm injury research necessitates using data from often-exploited vulnerable populations of black and brown americans. in order to minimize distrust, this study provides a framework for establishing ai trust and transparency with the general population. methods: we propose a model facts template that is easily extendable and decomposes accuracy and demographics into standardized and minimally complex values. this framework allows general users to assess the validity and biases of a model without diving into technical model documentation. examples: we apply the model facts template on two previously published models, a violence risk identification model and a suicide risk prediction model. we demonstrate the ease of accessing the appropriate information when the data is structured appropriately. discussion: the model facts template is limited in its current form to human based data and biases. like nutrition facts, it also will require some educational resources for users to grasp its full utility. human computer interaction experiments should be conducted to ensure that the interaction between user interface and model interface is as desired. conclusion: the model facts label is the first framework dedicated to establishing trust with end users and general population consumers. implementation of model facts into firearm injury research will provide public health practitioners and those impacted by firearm injury greater faith in the tools the research provides.
Feifan Song, Yuxuan Fan, Xin Zhang, Peiyi Wang, Houfeng Wang
Abstract: large language models (llms) rely on human preference alignment (hpa) to ensure the generation of safe content. due to the heavy cost associated with fine-tuning, fine-tuning-free methods have emerged, typically modifying llm decoding with external auxiliary methods. however, these methods do not essentially enhance the llm itself. in this paper, we rethink the derivation procedures of dpo, based on which we conversely build an instant scorer using the states of the llm before and after in-context learning (icl). accordingly, we propose a novel approach called in-context direct preference optimization (icdpo). it enables llms to borrow the hpa capabilities from superior llms with icl, generating well-aligned responses as estimated by the aforementioned instant scorer, thereby enhancing the final performance. icdpo can be further enhanced with a two-stage retriever and an upgraded scorer, both offering benefits. extensive experiments show its effectiveness, particularly in outperforming two fine-tuning-free baselines, and it exhibits competitiveness with sft + lora. we also conduct detailed analyses to offer comprehensive insights into icdpo.
Maryam Amirizaniani, Tanya Roosta, Aman Chadha, Chirag Shah
Abstract: as large language models (llms) gain wider adoption in various contexts, it becomes crucial to ensure they are reasonably safe, consistent, and reliable for an application at hand. this may require probing or auditing them. probing llms with varied iterations of a single question could reveal potential inconsistencies in their knowledge or functionality. however, a tool for performing such audits with simple workflow and low technical threshold is lacking. in this demo, we introduce "auditllm," a novel tool designed to evaluate the performance of various llms in a methodical way. auditllm's core functionality lies in its ability to test a given llm by auditing it using multiple probes generated from a single question, thereby identifying any inconsistencies in the model's understanding or operation. a reasonably robust, reliable, and consistent llm should output semantically similar responses for a question asked differently or by different people. based on this assumption, auditllm produces easily interpretable results regarding the llm's consistencies from a single question that the user enters. a certain level of inconsistency has been shown to be an indicator of potential bias, hallucinations, and other issues. one could then use the output of auditllm to further investigate issues with the aforementioned llm. to facilitate demonstration and practical uses, auditllm offers two key modes: (1) live mode which allows instant auditing of llms by analyzing responses to real-time queries; (2) batch mode which facilitates comprehensive llm auditing by processing multiple queries at once for in-depth analysis. this tool is beneficial for both researchers and general users, as it enhances our understanding of llms' capabilities in generating responses, using a standardized auditing platform.
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, Dacheng Tao
Abstract: despite the success of reinforcement learning from human feedback (rlhf) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. in this work, we tackle this problem from an information theoretic-perspective, and propose a generalizable and robust framework for reward modeling, namely inform, by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation. notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing inform as a promising tool for detecting reward overoptimization. inspired by this finding, we propose the integrated cluster deviation score (icds), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. extensive experiments on a wide range of settings and model scales (70m, 440m, 1.4b, and 7b) support the effectiveness of inform. further analyses reveal that inform's overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of rlhf. code will be released upon acceptance.
Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, Chirag Shah
Abstract: as llms become more pervasive across various users and scenarios, identifying potential issues when using these models becomes essential. examples include bias, inconsistencies, and hallucination. although auditing the llm for these problems is desirable, it is far from being easy or solved. an effective method is to probe the llm using different versions of the same question. this could expose inconsistencies in its knowledge or operation, indicating potential for bias or hallucination. however, to operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. in this paper we propose an automatic and scalable solution, where one uses a different llm along with human-in-the-loop. this approach offers verifiability and transparency, while avoiding circular reliance on the same llms, and increasing scientific rigor and generalizability. specifically, we present a novel methodology with two phases of verification using humans: standardized evaluation criteria to verify responses, and a structured prompt template to generate desired probes. experiments on a set of questions from truthfulqa dataset show that we can generate a reliable set of probes from one llm that can be used to audit inconsistencies in a different llm. the criteria for generating and applying auditing probes is generalizable to various llms regardless of the underlying structure or training mechanism.
Kyungsu Kim, Junhyun Park, Saul Langarica, Adham Mahmoud Alkhadrawi, Synho Do
Abstract: this study demonstrates the first in-hospital adaptation of a cloud-based ai, similar to chatgpt, into a secure model for analyzing radiology reports, prioritizing patient data privacy. by employing a unique sentence-level knowledge distillation method through contrastive learning, we achieve over 95% accuracy in detecting anomalies. the model also accurately flags uncertainties in its predictions, enhancing its reliability and interpretability for physicians with certainty indicators. these advancements represent significant progress in developing secure and efficient ai tools for healthcare, suggesting a promising future for in-hospital ai applications with minimal supervision.
Kaixuan Ji, Jiafan He, Quanquan Gu
Abstract: aligning large language models (llm) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (rlhf). despite their superior performance, current rlhf approaches often require a large amount of human-labelled preference data, which is expensive to collect. in this paper, inspired by the success of active learning, we address this problem by proposing query-efficient rlhf methods. we first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (appo) algorithm with an $\tilde{o}(d^2/\delta)$ regret bound and an $\tilde{o}(d^2/\delta^2)$ query complexity, where $d$ is the dimension of feature space and $\delta$ is the sub-optimality gap over all the contexts. we then propose adpo, a practical version of our algorithm based on direct preference optimization (dpo) and apply it to fine-tuning llms. our experiments show that adpo, while only making about half of queries for human preference, matches the performance of the state-of-the-art dpo method.
Jingxuan He, Mark Vero, Gabriela Krasnopolska, Martin Vechev
Abstract: modern language models (lms) have gained widespread acceptance in everyday and professional contexts, particularly in programming. an essential procedure enabling this adoption is instruction tuning, which substantially enhances lms' practical utility by training them to follow user instructions and human preferences. however, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. as a result, even the state-of-the-art instruction-tuned lms frequently produce unsafe code, posing significant security risks. in this work, we introduce safecoder to address this gap. safecoder performs security-centric fine-tuning using a diverse and high-quality dataset that we collected using an automated pipeline. we integrate the security fine-tuning with standard instruction tuning, to facilitate a joint optimization of both security and utility. despite its simplicity, we show that safecoder is effective across a variety of popular lms and datasets. it is able to drastically improve security (by about 30%), while preserving utility.
Congcong Wen, Jiazhao Liang, Shuaihang Yuan, Hao Huang, Yi Fang
Abstract: in the field of robotics and automation, navigation systems based on large language models (llms) have recently shown impressive performance. however, the security aspects of these systems have received relatively less attention. this paper pioneers the exploration of vulnerabilities in llm-based navigation models in urban outdoor environments, a critical area given the technology's widespread application in autonomous driving, logistics, and emergency services. specifically, we introduce a novel navigational prompt suffix (nps) attack that manipulates llm-based navigation models by appending gradient-derived suffixes to the original navigational prompt, leading to incorrect actions. we conducted comprehensive experiments on an llms-based navigation model that employs various llms for reasoning. our results, derived from the touchdown and map2seq street-view datasets under both few-shot learning and fine-tuning configurations, demonstrate notable performance declines across three metrics in the face of both white-box and black-box attacks. these results highlight the generalizability and transferability of the nps attack, emphasizing the need for enhanced security in llm-based navigation systems. as an initial countermeasure, we propose the navigational prompt engineering (npe) defense strategy, concentrating on navigation-relevant keywords to reduce the impact of adversarial suffixes. while initial findings indicate that this strategy enhances navigational safety, there remains a critical need for the wider research community to develop stronger defense methods to effectively tackle the real-world challenges faced by these systems.
Narun Raman, Taylor Lundy, Samuel Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz
Abstract: there is increasing interest in using llms as decision-making "agents." doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? settling these questions -- and more broadly, determining whether an llm agent is reliable enough to be trusted -- requires a methodology for assessing such an agent's economic rationality. in this paper, we provide one. we begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. we then propose a benchmark distribution that quantitatively scores an llms performance on these elements and, combined with a user-provided rubric, produces a "rationality report card." finally, we describe the results of a large-scale empirical experiment with 14 different llms, characterizing the both current state of the art and the impact of different model sizes on models' ability to exhibit rational behavior.
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
Abstract: language models often exhibit undesirable behaviors, such as gender bias or toxic language. interventions in the representation space were shown effective in mitigating such issues by altering the lm behavior. we first show that two prominent intervention techniques, linear erasure and steering vectors, do not enable a high degree of control and are limited in expressivity. we then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g., "toxic") resemble those of a target class (e.g., "non-toxic"). this approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the earth mover's problem under gaussian assumptions and provides theoretical guarantees on the representation space's geometric organization. we further build on this technique and derive a nonlinear intervention that enables controlled generation. we demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.
Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo
Abstract: large language models (llms) have surged in popularity in recent months, but they have demonstrated concerning capabilities to generate harmful content when manipulated. while techniques like safety fine-tuning aim to minimize harmful use, recent works have shown that llms remain vulnerable to attacks that elicit toxic responses. in this work, we introduce the proxy-guided attack on llms (pal), the first optimization-based attack on llms in a black-box query-only setting. in particular, it relies on a surrogate model to guide the optimization and a sophisticated loss designed for real-world llm apis. our attack achieves 84% attack success rate (asr) on gpt-3.5-turbo and 48% on llama-2-7b, compared to 4% for the current state of the art. we also propose gcg++, an improvement to the gcg attack that reaches 94% asr on white-box llama-2-7b, and the random-search attack on llms (ral), a strong but simple baseline for query-based attacks. we believe the techniques proposed in this work will enable more comprehensive safety testing of llms and, in the long term, the development of better security guardrails. the code can be found at https://github.com/chawins/pal.

2024-02-13

Andrew Hundt, Julia Schuller, Severin Kacianka
Abstract: machine learning (ml) and 'artificial intelligence' ('ai') methods tend to replicate and amplify existing biases and prejudices, as do robots with ai. for example, robots with facial recognition have failed to identify black women as human, while others have categorized people, such as black men, as criminals based on appearance alone. a 'culture of modularity' means harms are perceived as 'out of scope', or someone else's responsibility, throughout employment positions in the 'ai supply chain'. incidents are routine enough (incidentdatabase.ai lists over 2000 examples) to indicate that few organizations are capable of completely respecting peoples' rights; meeting claimed equity, diversity, and inclusion (edi or dei) goals; or recognizing and then addressing such failures in their organizations and artifacts. we propose a framework for adapting widely practiced research and development (r&d) project management methodologies to build organizational equity capabilities and better integrate known evidence-based best practices. we describe how project teams can organize and operationalize the most promising practices, skill sets, organizational cultures, and methods to detect and address rights-based fairness, equity, accountability, and ethical problems as early as possible when they are often less harmful and easier to mitigate; then monitor for unforeseen incidents to adaptively and constructively address them. our primary example adapts an agile development process based on scrum, one of the most widely adopted approaches to organizing r&d teams. we also discuss limitations of our proposed framework and future research directions.
Daniel Nahmias, Gal Engelberg, Dan Klein, Asaf Shabtai
Abstract: spear-phishing attacks present a significant security challenge, with large language models (llms) escalating the threat by generating convincing emails and facilitating target reconnaissance. to address this, we propose a detection approach based on a novel document vectorization method that utilizes an ensemble of llms to create representation vectors. by prompting llms to reason and respond to human-crafted questions, we quantify the presence of common persuasion principles in the email's content, producing prompted contextual document vectors for a downstream supervised machine learning model. we evaluate our method using a unique dataset generated by a proprietary system that automates target reconnaissance and spear-phishing email creation. our method achieves a 91% f1 score in identifying llm-generated spear-phishing emails, with the training set comprising only traditional phishing and benign emails. key contributions include an innovative document vectorization method utilizing llm reasoning, a publicly available dataset of high-quality spear-phishing emails, and the demonstrated effectiveness of our method in detecting such emails. this methodology can be utilized for various document classification tasks, particularly in adversarial problem domains.
Thilo Hagendorff
Abstract: the advent of generative artificial intelligence and the widespread adoption of it in society engendered intensive debates about its ethical implications and risks. these risks often differ from those associated with traditional discriminative machine learning. to synthesize the recent discourse and map its normative concepts, we conducted a scoping review on the ethics of generative artificial intelligence, including especially large language models and text-to-image models. our analysis provides a taxonomy of 378 normative issues in 19 topic areas and ranks them according to their prevalence in the literature. the study offers a comprehensive overview for scholars, practitioners, or policymakers, condensing the ethical debates surrounding fairness, safety, harmful content, hallucinations, privacy, interaction risks, security, alignment, societal impacts, and others. we discuss the results, evaluate imbalances in the literature, and explore unsubstantiated risk scenarios.
Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu
Abstract: large language models~(llms) have gained immense popularity and are being increasingly applied in various domains. consequently, ensuring the security of these models is of paramount importance. jailbreak attacks, which manipulate llms to generate malicious content, are recognized as a significant vulnerability. while existing research has predominantly focused on direct jailbreak attacks on llms, there has been limited exploration of indirect methods. the integration of various plugins into llms, notably retrieval augmented generation~(rag), which enables llms to incorporate external knowledge bases into their response generation such as gpts, introduces new avenues for indirect jailbreak attacks. to fill this gap, we investigate indirect jailbreak attacks on llms, particularly gpts, introducing a novel attack vector named retrieval augmented generation poisoning. this method, pandora, exploits the synergy between llms and rag through prompt manipulation to generate unexpected responses. pandora uses maliciously crafted content to influence the rag process, effectively initiating jailbreak attacks. our preliminary tests show that pandora successfully conducts jailbreak attacks in four different scenarios, achieving higher success rates than direct attacks, with 64.3\% for gpt-3.5 and 34.8\% for gpt-4.
Cary Coglianese, Colton R. Crum
Abstract: fervent calls for more robust governance of the harms associated with artificial intelligence (ai) are leading to the adoption around the world of what regulatory scholars have called a management-based approach to regulation. recent initiatives in the united states and europe, as well as the adoption of major self-regulatory standards by the international organization for standardization, share in common a core management-based paradigm. these management-based initiatives seek to motivate an increase in human oversight of how ai tools are trained and developed. refinements and systematization of human-guided training techniques will thus be needed to fit within this emerging era of management-based regulatory paradigm. if taken seriously, human-guided training can alleviate some of the technical and ethical pressures on ai, boosting ai performance with human intuition as well as better addressing the needs for fairness and effective explainability. in this paper, we discuss the connection between the emerging management-based regulatory frameworks governing ai and the need for human oversight during training. we broadly cover some of the technical components involved in human-guided training and then argue that the kinds of high-stakes use cases for ai that appear of most concern to regulators should lean more on human-guided training than on data-only training. we hope to foster a discussion between legal scholars and computer scientists involving how to govern a domain of technology that is vast, heterogenous, and dynamic in its applications and risks.
Freddy Heppell, Mehmet E. Bakir, Kalina Bontcheva
Abstract: as large language models (llms) become more proficient, their misuse in large-scale viral disinformation campaigns is a growing concern. this study explores the capability of chatgpt to generate unconditioned claims about the war in ukraine, an event beyond its knowledge cutoff, and evaluates whether such claims can be differentiated by human readers and automated tools from human-written ones. we compare war-related claims from claimreview, authored by ifcn-registered fact-checkers, and similar short-form content generated by chatgpt. we demonstrate that chatgpt can produce realistic, target-specific disinformation cheaply, fast, and at scale, and that these claims cannot be reliably distinguished by humans or existing automated tools.
Xiangming Gu, Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Ye Wang, Jing Jiang, Min Lin
Abstract: a multimodal large language model (mllm) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an mllm and cause unaligned behaviors. in this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. it entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. to validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million llava-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. our project page is available at https://sail-sg.github.io/agent-smith/.
Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, Min Lin
Abstract: backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. in this work, we present anydoor, a test-time backdoor attack against multimodal large language models (mllms), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. anydoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. in our experiments, we validate the effectiveness of anydoor against popular mllms such as llava-1.5, minigpt-4, instructblip, and blip-2, as well as provide comprehensive ablation studies. notably, because the backdoor is injected by a universal perturbation, anydoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. our project page is available at https://sail-sg.github.io/anydoor/.
Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, Bin Hu
Abstract: jailbreaks on large language models (llms) have recently received increasing attention. for a comprehensive assessment of llm safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on llm attacks. in this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. based on this connection, we adapt the energy-based constrained decoding with langevin dynamics (cold), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the cold-attack framework which unifies and automates the search of adversarial llm attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. the controllability enabled by cold-attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. our extensive experiments on various llms (llama-2, mistral, vicuna, guanaco, gpt-3.5) show cold-attack's broad applicability, strong controllability, high success rate, and attack transferability. our code is available at https://github.com/yu-fangxu/cold-attack.
Tobias Schimanski, Jingwei Ni, Mathias Kraus, Elliott Ash, Markus Leippold
Abstract: advances towards more faithful and traceable answers of large language models (llms) are crucial for various research and practical endeavors. one avenue in reaching this goal is basing the answers on reliable sources. however, this evidence-based qa has proven to work insufficiently with llms in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). in this work, we systematically investigate how to robustly fine-tune llms for better source quality and answer attributability. specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. we further introduce four test sets to benchmark the robustness of fine-tuned specialist models. extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving evidence-based qa.
Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang, Nicholas Roy, Chuchu Fan
Abstract: prompt optimization aims to find the best prompt to a large language model (llm) for a given task. llms have been successfully used to help find and improve prompt candidates for single-step tasks. however, realistic tasks for agents are multi-step and introduce new challenges: (1) prompt content is likely to be more extensive and complex, making it more difficult for llms to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. while humans struggle to optimize prompts, they are good at providing feedback about llm outputs; we therefore introduce a new llm-driven discrete prompt optimization framework that incorporates human-designed feedback rules about potential errors to automatically offer direct suggestions for improvement. our framework is stylized as a genetic algorithm in which an llm generates new candidate prompts from a parent prompt and its associated feedback; we use a learned heuristic function that predicts prompt performance to efficiently sample from these candidates. this approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across eight representative multi-step tasks (an average 27.7% and 28.2% improvement to current best methods on gpt-3.5 and gpt-4, respectively). we further show that the score function for tasks can be modified to better align with individual preferences. we believe our work can serve as a benchmark for automatic prompt optimization for llm-driven multi-step tasks. datasets and codes are available at https://github.com/yongchao98/promst. project page is available at https://yongchao98.github.io/mit-realm-promst.
Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou
Abstract: reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. remarkable success has been achieved in the language domain by using reinforcement learning (rl) to maximize rewards that reflect human preference. however, in the vision domain, existing rl-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. in this paper, we propose proximal reward difference prediction (prdp), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100k prompts. our key innovation is the reward difference prediction (rdp) objective that has the same optimal solution as the rl objective while enjoying better training stability. specifically, the rdp objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. we theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the rl objective. we further develop an online algorithm with proximal updates to stably optimize the rdp objective. in experiments, we demonstrate that prdp can match the reward maximization ability of well-established rl-based methods in small-scale training. furthermore, through large-scale training on text prompts from the human preference dataset v2 and the pick-a-pic v1 dataset, prdp achieves superior generation quality on a diverse set of complex, unseen prompts whereas rl-based methods completely fail.
Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, Julian Mcauley
Abstract: do current large language models (llms) better solve graph reasoning and generation tasks with parameter updates? in this paper, we propose instructgraph, a framework that empowers llms with the abilities of graph reasoning and generation by instruction tuning and preference alignment. specifically, we first propose a structured format verbalizer to unify all graph data into a universal code-like format, which can simply represent the graph without any external graph-specific encoders. furthermore, a graph instruction tuning stage is introduced to guide llms in solving graph reasoning and generation tasks. finally, we identify potential hallucination problems in graph tasks and sample negative instances for preference alignment, the target of which is to enhance the output's reliability of the model. extensive experiments across multiple graph-centric tasks exhibit that instructgraph can achieve the best performance and outperform gpt-4 and llama2 by more than 13\% and 38\%, respectively.
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu
Abstract: we explore machine unlearning (mu) in the domain of large language models (llms), referred to as llm unlearning. this initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information. we envision llm unlearning becoming a pivotal element in the life-cycle management of llms, potentially standing as an essential foundation for developing generative ai that is not only safe, secure, and trustworthy, but also resource-efficient without the need of full retraining. we navigate the unlearning landscape in llms from conceptual formulation, methodologies, metrics, and applications. in particular, we highlight the often-overlooked aspects of existing llm unlearning research, e.g., unlearning scope, data-model interaction, and multifaceted efficacy assessment. we also draw connections between llm unlearning and related areas such as model editing, influence functions, model explanation, adversarial training, and reinforcement learning. furthermore, we outline an effective assessment framework for llm unlearning and explore its applications in copyright and privacy safeguards and sociotechnical harm reduction.
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
Abstract: reinforcement learning from human feedback (rlhf) aligns language models to human preferences by employing a singular reward model derived from preference data. however, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. in this work, we first derive an impossibility result of alignment with single reward rlhf, thereby highlighting its insufficiency in representing diverse human preferences. to provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a maxmin alignment objective for policy learning inspired by the egalitarian principle in social choice theory to better represent diverse human preferences. we elucidate the connection of our proposed approach to distributionally robust optimization and general utility rl, thereby highlighting the generality and robustness of our proposed solution. we present comprehensive experimental results on small-scale (gpt-2) and large-scale language models (with tulu2-7b) and show the efficacy of the proposed approach in the presence of diversity among human preferences. our algorithm achieves an average improvement of more than 16% in win-rates over conventional rlhf algorithms and improves the win-rate (accuracy) for minority groups by over 33% without compromising the performance of majority groups, showcasing the robustness and fairness of our approach. we remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

2024-02-12

Nathan I. N. Henry, Mangor Pedersen, Matt Williams, Jamin L. B. Martin, Liesje Donkin
Abstract: the value-loading problem is a significant challenge for researchers aiming to create artificial intelligence (ai) systems that align with human values and preferences. this problem requires a method to define and regulate safe and optimal limits of ai behaviors. in this work, we propose halo (hormetic alignment via opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of ai. behavioral hormesis is a phenomenon where low frequencies of a behavior have beneficial effects, while high frequencies are harmful. by modeling behaviors as allostatic opponent processes, we can use either behavioral frequency response analysis (bfra) or behavioral count response analysis (bcra) to quantify the hormetic limits of repeatable behaviors. we demonstrate how halo can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated ai tasked with making paperclips could end up converting all matter in the universe into paperclips. our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility. this positions halo as a promising solution for the value-loading problem, which involves embedding human-aligned values into an ai system, and the weak-to-strong generalization problem, which explores whether weak models can supervise stronger models as they become more intelligent. hence, halo opens several research avenues that may lead to the development of a computational value system that allows an ai algorithm to learn whether the decisions it makes are right or wrong.
Xabier Echeberria-Barrio, Mikel Gorricho, Selene Valencia, Francesco Zola
Abstract: the usage of artificial intelligence (ai) systems has increased exponentially, thanks to their ability to reduce the amount of data to be analyzed, the user efforts and preserving a high rate of accuracy. however, introducing this new element in the loop has converted them into attacked points that can compromise the reliability of the systems. this new scenario has raised crucial challenges regarding the reliability and trustworthiness of the ai models, as well as about the uncertainties in their response decisions, becoming even more crucial when applied in critical domains such as healthcare, chemical, electrical plants, etc. to contain these issues, in this paper, we present neuralsentinel (ns), a tool able to validate the reliability and trustworthiness of ai models. this tool combines attack and defence strategies and explainability concepts to stress an ai model and help non-expert staff increase their confidence in this new system by understanding the model decisions. ns provide a simple and easy-to-use interface for helping humans in the loop dealing with all the needed information. this tool was deployed and used in a hackathon event to evaluate the reliability of a skin cancer image detector. during the event, experts and non-experts attacked and defended the detector, learning which factors were the most important for model misclassification and which techniques were the most efficient. the event was also used to detect ns's limitations and gather feedback for further improvements.
Sumeet Ramesh Motwani, Mikhail Baranchuk, Martin Strohmeier, Vijay Bolina, Philip H. S. Torr, Lewis Hammond, Christian Schroeder De Witt
Abstract: recent capability increases in large language models (llms) open up applications in which teams of communicating generative ai agents solve joint tasks. this poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. modern steganographic techniques could render such dynamics hard to detect. in this paper, we comprehensively formalise the problem of secret collusion in systems of generative ai agents by drawing on relevant concepts from both the ai and security literature. we study incentives for the use of steganography, and propose a variety of mitigation measures. our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. we provide extensive empirical results across a range of contemporary llms. while the steganographic capabilities of current models remain limited, gpt-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. we conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative ai models.
Norbert Tihanyi, Mohamed Amine Ferrag, Ridhi Jain, Merouane Debbah
Abstract: large language models (llms) excel across various domains, from computer vision to medical diagnostics. however, understanding the diverse landscape of cybersecurity, encompassing cryptography, reverse engineering, and managerial facets like risk assessment, presents a challenge, even for human experts. in this paper, we introduce cybermetric, a benchmark dataset comprising 10,000 questions sourced from standards, certifications, research papers, books, and other publications in the cybersecurity domain. the questions are created through a collaborative process, i.e., merging expert knowledge with llms, including gpt-3.5 and falcon-180b. human experts spent over 200 hours verifying their accuracy and relevance. beyond assessing llms' knowledge, the dataset's main goal is to facilitate a fair comparison between humans and different llms in cybersecurity. to achieve this, we carefully selected 80 questions covering a wide range of topics within cybersecurity and involved 30 participants of diverse expertise levels, facilitating a comprehensive comparison between human and machine intelligence in this area. the findings revealed that llms outperformed humans in almost every aspect of cybersecurity.
Hui Liu, Wenya Wang, Haoru Li, Haoliang Li
Abstract: the proliferation of fake news has emerged as a severe societal problem, raising significant interest from industry and academia. while existing deep-learning based methods have made progress in detecting fake news accurately, their reliability may be compromised caused by the non-transparent reasoning processes, poor generalization abilities and inherent risks of integration with large language models (llms). to address this challenge, we propose {\methodname}, a novel framework for trustworthy fake news detection that prioritizes explainability, generalizability and controllability of models. this is achieved via a dual-system framework that integrates cognition and decision systems, adhering to the principles above. the cognition system harnesses human expertise to generate logical predicates, which guide llms in generating human-readable logic atoms. meanwhile, the decision system deduces generalizable logic rules to aggregate these atoms, enabling the identification of the truthfulness of the input news across diverse domains and enhancing transparency in the decision-making process. finally, we present comprehensive evaluation results on four datasets, demonstrating the feasibility and trustworthiness of our proposed framework. our implementation is available at \url{https://github.com/less-and-less-bugs/trust_teller}.
Wei Zou, Runpeng Geng, Binghui Wang, Jinyuan Jia
Abstract: large language models (llms) have achieved remarkable success due to their exceptional generative capabilities. despite their success, they also have inherent limitations such as a lack of up-to-date knowledge and hallucination. retrieval-augmented generation (rag) is a state-of-the-art technique to mitigate those limitations. in particular, given a question, rag retrieves relevant knowledge from a knowledge database to augment the input of the llm. for instance, the retrieved knowledge could be a set of top-k texts that are most semantically similar to the given question when the knowledge database contains millions of texts collected from wikipedia. as a result, the llm could utilize the retrieved knowledge as the context to generate an answer for the given question. existing studies mainly focus on improving the accuracy or efficiency of rag, leaving its security largely unexplored. we aim to bridge the gap in this work. particularly, we propose poisonedrag , a set of knowledge poisoning attacks to rag, where an attacker could inject a few poisoned texts into the knowledge database such that the llm generates an attacker-chosen target answer for an attacker-chosen target question. we formulate knowledge poisoning attacks as an optimization problem, whose solution is a set of poisoned texts. depending on the background knowledge (e.g., black-box and white-box settings) of an attacker on the rag, we propose two solutions to solve the optimization problem, respectively. our results on multiple benchmark datasets and llms show our attacks could achieve 90% attack success rates when injecting 5 poisoned texts for each target question into a database with millions of texts. we also evaluate recent defenses and our results show they are insufficient to defend against our attacks, highlighting the need for new defenses.
Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
Abstract: existing methods for controlling language models, such as rlhf and constitutional ai, involve determining which llm behaviors are desirable and training them into a language model. however, in many cases, it is desirable for llms to be controllable \textit{at inference time}, so that they can be used in multiple contexts with diverse needs. we illustrate this with the \textbf{pink elephant problem}: instructing an llm to avoid discussing a certain entity (a ``pink elephant''), and instead discuss a preferred entity (``grey elephant''). we apply a novel simplification of constitutional ai, \textbf{direct principle feedback}, which skips the ranking of responses and uses dpo directly on critiques and revisions. our results show that after dpf fine-tuning on our synthetic pink elephants dataset, our 13b fine-tuned llama 2 model significantly outperforms llama-2-13b-chat and a prompted baseline, and performs as well as gpt-4 in on our curated test set assessing the pink elephant problem.

2024-02-11

Arifa Khan, P. Saravanan, S. K Venkatesan
Abstract: we provide a birds eye view of the rapid developments in ai and deep learning that has led to the path-breaking emergence of ai in large language models. the aim of this study is to place all these developments in a pragmatic broader historical social perspective without any exaggerations while at the same time without any pessimism that created the ai winter in the 1970s to 1990s. we also at the same time point out toxicity, bias, memorization, sycophancy, logical inconsistencies, hallucinations that exist just as a warning to the overly optimistic. we note here that just as this emergence of ai seems to occur at a threshold point in the number of neural connections or weights, it has also been observed that human brain and especially the cortex region is nothing special or extraordinary but simply a case of scaled-up version of the primate brain and that even the human intelligence seems like an emergent phenomena of scale.
Zhibo Hu, Chen Wang, Yanfeng Shu, N/A Helen, N/A Paik, Liming Zhu
Abstract: the robustness of large language models (llms) becomes increasingly important as their use rapidly grows in a wide range of domains. retrieval-augmented generation (rag) is considered as a means to improve the trustworthiness of text generation from llms. however, how the outputs from rag-based llms are affected by slightly different inputs is not well studied. in this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. we systematically evaluate the effect of such prefixes on rag by introducing a novel optimization technique called gradient guided prompt perturbation (ggpp). ggpp achieves a high success rate in steering outputs of rag-based llms to targeted wrong answers. it can also cope with instructions in the prompts requesting to ignore irrelevant context. we also exploit llms' neuron activation difference between prompts with and without ggpp perturbations to give a method that improves the robustness of rag-based llms through a highly effective detector trained on neuron activation triggered by ggpp generated prompts. our evaluation on open-sourced llms demonstrates the effectiveness of our methods.
Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths
Abstract: in day-to-day communication, people often approximate the truth - for example, rounding the time or omitting details - in order to be maximally helpful to the listener. how do large language models (llms) handle such nuanced trade-offs? to address this question, we use psychological models and experiments designed to characterize human behavior to analyze llms. we test a range of llms and explore how optimization for human preferences or inference-time reasoning affects these trade-offs. we find that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-of-thought prompting skews llms towards helpfulness over honesty. finally, gpt-4 turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context. our findings reveal the conversational values internalized by llms and suggest that even these abstract values can, to a degree, be steered by zero-shot prompting.
Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
Abstract: in this work, we study the issue of reward hacking on the response length, a challenge emerging in reinforcement learning from human feedback (rlhf) on llms. a well-formatted, verbose but less helpful response from the llms can often deceive llms or even human evaluators to achieve high scores. the same issue also holds for some reward models in rl. to address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between llm evaluation score and response length obtained by varying training hyperparameters. based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in rl on mitigating length bias. we further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. we then discard the length head in rl to prevent reward hacking on length. experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
Alice Cai, Ian Arawjo, Elena L. Glassman
Abstract: the vast majority of discourse around ai development assumes that subservient, "moral" models aligned with "human values" are universally beneficial -- in short, that good ai is sycophantic ai. we explore the shadow of the sycophantic paradigm, a design space we term antagonistic ai: ai systems that are disagreeable, rude, interrupting, confrontational, challenging, etc. -- embedding opposite behaviors or values. far from being "bad" or "immoral," we consider whether antagonistic ai systems may sometimes have benefits to users, such as forcing users to confront their assumptions, build resilience, or develop healthier relational boundaries. drawing from formative explorations and a speculative design workshop where participants designed fictional ai technologies that employ antagonism, we lay out a design space for antagonistic ai, articulating potential benefits, design techniques, and methods of embedding antagonistic elements into user experience. finally, we discuss the many ethical challenges of this space and identify three dimensions for the responsible design of antagonistic ai -- consent, context, and framing.
Kyungha Kim, Sangyun Lee, Kung-Hsiang Huang, Hou Pong Chan, Manling Li, Heng Ji
Abstract: fact-checking research has extensively explored verification but less so the generation of natural-language explanations, crucial for user trust. while large language models (llms) excel in text generation, their capability for producing faithful explanations in fact-checking remains underexamined. our study investigates llms' ability to generate such explanations, finding that zero-shot prompts often result in unfaithfulness. to address these challenges, we propose the multi-agent debate refinement (madr) framework, leveraging multiple llms as agents with diverse roles in an iterative refining process aimed at enhancing faithfulness in generated explanations. madr ensures that the final explanation undergoes rigorous validation, significantly reducing the likelihood of unfaithful elements and aligning closely with the provided evidence. experimental results demonstrate that madr significantly improves the faithfulness of llm-generated explanations to the evidence, advancing the credibility and trustworthiness of these explanations.

2024-02-10

Hyukhun Koh, Dohyung Kim, Minwoo Lee, Kyomin Jung
Abstract: in the pursuit of developing large language models (llms) that adhere to societal standards, it is imperative to discern the existence of toxicity in the generated text. the majority of existing toxicity metrics rely on encoder models trained on specific toxicity datasets. however, these encoders are susceptible to out-of-distribution (ood) problems and depend on the definition of toxicity assumed in a dataset. in this paper, we introduce an automatic robust metric grounded on llms to distinguish whether model responses are toxic. we start by analyzing the toxicity factors, followed by examining the intrinsic toxic attributes of llms to ascertain their suitability as evaluators. subsequently, we evaluate our metric, llms as toxicity evaluators (latte), on evaluation datasets.the empirical results indicate outstanding performance in measuring toxicity, improving upon state-of-the-art metrics by 12 points in f1 score without training procedure. we also show that upstream toxicity has an influence on downstream metrics.
Jonathan Evertz, Merlin Chlosta, Lea Schönherr, Thorsten Eisenhofer
Abstract: large language models (llms) are increasingly integrated with external tools. while these integrations can significantly improve the functionality of llms, they also create a new attack surface where confidential data may be disclosed between different components. specifically, malicious tools can exploit vulnerabilities in the llm itself to manipulate the model and compromise the data of other services, raising the question of how private data can be protected in the context of llm integrations. in this work, we provide a systematic way of evaluating confidentiality in llm-integrated systems. for this, we formalize a "secret key" game that can capture the ability of a model to conceal private information. this enables us to compare the vulnerability of a model against confidentiality attacks and also the effectiveness of different defense strategies. in this framework, we evaluate eight previously published attacks and four defenses. we find that current defenses lack generalization across attack strategies. building on this analysis, we propose a method for robustness fine-tuning, inspired by adversarial training. this approach is effective in lowering the success rate of attackers and in improving the system's resilience against unknown attacks.
Rui Ye, Wenhao Wang, Jingyi Chai, Dihan Li, Zexi Li, Yinda Xu, Yaxin Du, Yanfeng Wang, Siheng Chen
Abstract: trained on massive publicly available data, large language models (llms) have demonstrated tremendous success across various fields. while more data contributes to better performance, a disconcerting reality is that high-quality public data will be exhausted in a few years. in this paper, we offer a potential next step for contemporary llms: collaborative and privacy-preserving llm training on the underutilized distributed private data via federated learning (fl), where multiple data owners collaboratively train a shared model without transmitting raw data. to achieve this, we build a concise, integrated, and research-friendly framework/codebase, named openfedllm. it covers federated instruction tuning for enhancing instruction-following capability, federated value alignment for aligning with human values, and 7 representative fl algorithms. besides, openfedllm supports training on diverse domains, where we cover 8 training datasets; and provides comprehensive evaluations, where we cover 30+ evaluation metrics. through extensive experiments, we observe that all fl algorithms outperform local training on training llms, demonstrating a clear performance improvement across a variety of settings. notably, in a financial benchmark, llama2-7b fine-tuned by applying any fl algorithm can outperform gpt-4 by a significant margin while the model obtained through individual training cannot, demonstrating strong motivation for clients to participate in fl. the code is available at https://github.com/rui-ye/openfedllm.
Ankit Pal, Malaikannan Sankarasubbu
Abstract: large language models have the potential to be valuable in the healthcare industry, but it's crucial to verify their safety and effectiveness through rigorous evaluation. for this purpose, we comprehensively evaluated both open-source llms and google's new multimodal llm called gemini across medical reasoning, hallucination detection, and medical visual question answering tasks. while gemini showed competence, it lagged behind state-of-the-art models like medpalm 2 and gpt-4 in diagnostic accuracy. additionally, gemini achieved an accuracy of 61.45\% on the medical vqa dataset, significantly lower than gpt-4v's score of 88\%. our analysis revealed that gemini is highly susceptible to hallucinations, overconfidence, and knowledge gaps, which indicate risks if deployed uncritically. we also performed a detailed analysis by medical subject and test type, providing actionable feedback for developers and clinicians. to mitigate risks, we applied prompting strategies that improved performance. additionally, we facilitated future research and development by releasing a python module for medical llm evaluation and establishing a dedicated leaderboard on hugging face for medical domain llms. python module can be found at https://github.com/promptslab/rosettaeval
Sven Cattell, Avijit Ghosh
Abstract: harm reporting in the field of artificial intelligence (ai) currently operates on an ad hoc basis, lacking a structured process for disclosing or addressing algorithmic flaws. in contrast, the coordinated vulnerability disclosure (cvd) ethos and ecosystem play a pivotal role in software security and transparency. within the u.s. context, there has been a protracted legal and policy struggle to establish a safe harbor from the computer fraud and abuse act, aiming to foster institutional support for security researchers acting in good faith. notably, algorithmic flaws in machine learning (ml) models present distinct challenges compared to traditional software vulnerabilities, warranting a specialized approach. to address this gap, we propose the implementation of a dedicated coordinated flaw disclosure (cfd) framework tailored to the intricacies of machine learning and artificial intelligence issues. this paper delves into the historical landscape of disclosures in ml, encompassing the ad hoc reporting of harms and the emergence of participatory auditing. by juxtaposing these practices with the well-established disclosure norms in cybersecurity, we argue that the broader adoption of cfd has the potential to enhance public trust through transparent processes that carefully balance the interests of both organizations and the community.

2024-02-09

Juhyun Oh, Eunsu Kim, Inha Cha, Alice Oh
Abstract: this paper explores the assumption that large language models (llms) skilled in generation tasks are equally adept as evaluators. we assess the performance of three llms and one open-source lm in question-answering (qa) and evaluation tasks using the triviaqa (joshi et al., 2017) dataset. results indicate a significant disparity, with llms exhibiting lower performance in evaluation tasks compared to generation tasks. intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of llms as evaluators. this study contributes to the understanding of "the generative ai paradox" (west et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang
Abstract: although large language models (llms) have achieved tremendous success in various applications, they are also susceptible to certain prompts that can induce them to bypass built-in safety measures and provide dangerous or illegal content, a phenomenon known as jailbreak. to protect llms from producing harmful information, various defense strategies are proposed, with most focusing on content filtering or adversarial training of models. in this paper, we propose an approach named prompt adversarial tuning (pat) to train a defense control mechanism, which is then embedded as a prefix to user prompts to implement our defense strategy. we design a training process similar to adversarial training to achieve our optimized goal, alternating between updating attack and defense controls. to our knowledge, we are the first to implement defense from the perspective of prompt tuning. once employed, our method will hardly impact the operational efficiency of llms. experiments show that our method is effective in both black-box and white-box settings, reducing the success rate of advanced attacks to nearly 0 while maintaining the benign answer rate of 80% to simple benign questions. our work might potentially chart a new perspective for future explorations in llm security.
Nardine Osman, "Mark D'Inverno"
Abstract: one of today's most significant societal challenges is building ai systems whose behaviour, or the behaviour it enables within communities of interacting agents (human and artificial), aligns with human values. to address this challenge, we detail a formal model of human values for their explicit computational representation. to our knowledge, this has not been attempted as yet, which is surprising given the growing volume of research integrating values within ai. taking as our starting point the wealth of research investigating the nature of human values from social psychology over the last few decades, we set out to provide such a formal model. we show how this model can provide the foundational apparatus for ai-based reasoning over values, and demonstrate its applicability in real-world use cases. we illustrate how our model captures the key ideas from social psychology research and propose a roadmap for future integrated, and interdisciplinary, research into human values in ai. the ability to automatically reason over values not only helps address the value alignment problem but also facilitates the design of ai systems that can support individuals and communities in making more informed, value-aligned decisions. more and more, individuals and organisations are motivated to understand their values more explicitly and explore whether their behaviours and attitudes properly reflect them. our work on modelling human values will enable ai systems to be designed and deployed to meet this growing need.
Sizhe Chen, Julien Piet, Chawin Sitawarin, David Wagner
Abstract: recent advances in large language models (llms) enable exciting llm-integrated applications, which perform text-based tasks by utilizing their advanced language understanding capabilities. however, as llms have improved, so have the attacks against them. prompt injection attacks are an important threat: they trick the model to deviate from the original application's instructions and instead follow user directives. these attacks rely on the llm's ability to follow instructions and inability to separate the prompts and user data. we introduce structured queries, a general approach to tackle this problem. structured queries separate prompts and data into two channels. we implement a system that supports structured queries. this system is made of (1) a secure front-end that formats a prompt and user data into a special format, and (2) a specially trained llm that can produce high-quality outputs from these inputs. the llm is trained using a novel fine-tuning strategy: we convert a base (non-instruction-tuned) llm to a structured instruction-tuned model that will only follow instructions in the prompt portion of a query. to do so, we augment standard instruction tuning datasets with examples that also include instructions in the data portion of the query, and fine-tune the model to ignore these. our system significantly improves resistance to prompt injection attacks, with little or no impact on utility. our code is released at https://github.com/sizhe-chen/promptinjectiondefense.
Bianca-Mihaela Ganescu, Jonathan Passerat-Palmbach
Abstract: generative ai, exemplified by models like transformers, has opened up new possibilities in various domains but also raised concerns about fairness, transparency and reliability, especially in fields like medicine and law. this paper emphasizes the urgency of ensuring fairness and quality in these domains through generative ai. it explores using cryptographic techniques, particularly zero-knowledge proofs (zkps), to address concerns regarding performance fairness and accuracy while protecting model privacy. applying zkps to machine learning models, known as zkml (zero-knowledge machine learning), enables independent validation of ai-generated content without revealing sensitive model information, promoting transparency and trust. zkml enhances ai fairness by providing cryptographic audit trails for model predictions and ensuring uniform performance across users. we introduce snarkgpt, a practical zkml implementation for transformers, to empower users to verify output accuracy and quality while preserving model privacy. we present a series of empirical results studying snarkgpt's scalability and performance to assess the feasibility and challenges of adopting a zkml-powered approach to capture quality and performance fairness problems in generative ai models.
Kaiqu Liang, Zixu Zhang, Jaime Fernández Fisac
Abstract: large language models (llms) exhibit advanced reasoning skills, enabling robots to comprehend natural language instructions and strategically plan high-level actions through proper grounding. however, llm hallucination may result in robots confidently executing plans that are misaligned with user goals or, in extreme cases, unsafe. additionally, inherent ambiguity in natural language instructions can induce task uncertainty, particularly in situations where multiple valid options exist. to address this issue, llms must identify such uncertainty and proactively seek clarification. this paper explores the concept of introspective planning as a systematic method for guiding llms in forming uncertainty--aware plans for robotic task execution without the need for fine-tuning. we investigate uncertainty quantification in task-level robot planning and demonstrate that introspection significantly improves both success rates and safety compared to state-of-the-art llm-based planning approaches. furthermore, we assess the effectiveness of introspective planning in conjunction with conformal prediction, revealing that this combination yields tighter confidence bounds, thereby maintaining statistical success guarantees with fewer superfluous user clarification queries.
Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, Bhuwan Dhingra
Abstract: to enhance large language models' (llms) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. however, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. this approach does not apply to long-form generation, where an answer can be partially correct. addressing this gap, we introduce a unified calibration framework, in which both the correctness of the llms' responses and their associated confidence levels are treated as distributions across a range of scores. within this framework, we develop three metrics to precisely evaluate llm calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. our experiments, which include long-form qa and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. we also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. lastly, we showcase a practical application of our system: selecting and cascading open-source models and chatgpt to optimize correctness given a limited api budget. this research not only challenges existing notions of llm calibration but also offers practical methodologies for improving trustworthiness in long-form generation.
Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao
Abstract: large language models (llms) have empowered intelligent agents to execute intricate tasks within domain-specific software such as browsers and games. however, when applied to general-purpose software systems like operating systems, llm agents face three primary challenges. firstly, the action space is vast and dynamic, posing difficulties for llm agents to maintain an up-to-date understanding and deliver accurate responses. secondly, real-world tasks often require inter-application cooperation}, demanding farsighted planning from llm agents. thirdly, agents need to identify optimal solutions aligning with user constraints, such as security concerns and preferences. these challenges motivate androidarena, an environment and benchmark designed to evaluate llm agents on a modern operating system. to address high-cost of manpower, we design a scalable and semi-automated method to construct the benchmark. in the task evaluation, androidarena incorporates accurate and adaptive metrics to address the issue of non-unique solutions. our findings reveal that even state-of-the-art llm agents struggle in cross-app scenarios and adhering to specific constraints. additionally, we identify a lack of four key capabilities, i.e., understanding, reasoning, exploration, and reflection, as primary reasons for the failure of llm agents. furthermore, we provide empirical analysis on the failure of reflection, and improve the success rate by 27% with our proposed exploration strategy. this work is the first to present valuable insights in understanding fine-grained weakness of llm agents, and offers a path forward for future research in this area. environment, benchmark, and evaluation code for androidarena are released at https://github.com/androidarenaagent/androidarena.
Rui-Jie Yew, Lucy Qin, Suresh Venkatasubramanian
Abstract: data forms the backbone of machine learning. thus, data protection law has strong bearing on how ml systems are governed. given that most requirements accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. privacy-preserving techniques incentivized by data protection law -- data protection techniques -- constitute an important strategy for ml development because they are used to distill data until it potentially falls outside the scope of data protection laws. in this paper, we examine the impact of a rhetoric that deems data wrapped in privacy-preserving techniques as data that is "good-to-go". we show how the application of data protection techniques in the development of ml systems -- from private set intersection as part of dataset curation to homomorphic encryption and federated learning as part of model computation to the framing of the privacy-utility trade-off as part of model updating -- can further support individual monitoring and data consolidation. with data accumulation at the core of how the ml pipeline is configured, we argue that data protection techniques are often instrumentalized in ways that support infrastructures of surveillance, rather than to protect individuals associated with data. finally, we propose technology and policy strategies to evaluate data protection techniques in light of the protections they actually confer. we conclude by highlighting the role that security technologists might play in devising policies that combat surveillance ml technologies -- recommending the adversarial mindset inherent to the profession to more precisely articulate and prevent the use of "privacy-preserving" scaffoldings that support surveillance.
Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
Abstract: the development of large language models (llms) has notably transformed numerous sectors, offering impressive text generation capabilities. yet, the reliability and truthfulness of these models remain pressing concerns. to this end, we investigate iterative prompting, a strategy hypothesized to refine llm responses, assessing its impact on llm truthfulness, an area which has not been thoroughly explored. our extensive experiments delve into the intricacies of iterative prompting variants, examining their influence on the accuracy and calibration of model responses. our findings reveal that naive prompting methods significantly undermine truthfulness, leading to exacerbated calibration errors. in response to these challenges, we introduce several prompting variants designed to address the identified issues. these variants demonstrate marked improvements over existing baselines, signaling a promising direction for future research. our work provides a nuanced understanding of iterative prompting and introduces novel approaches to enhance the truthfulness of llms, thereby contributing to the development of more accurate and trustworthy ai systems.
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt
Abstract: language models influence the external world: they query apis that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. these interactions form feedback loops: llm outputs affect the world, which in turn affect subsequent llm outputs. in this work, we show that feedback loops can cause in-context reward hacking (icrh), where the llm at test-time optimizes a (potentially implicit) objective but creates negative side effects in the process. for example, consider an llm agent deployed to increase twitter engagement; the llm may retrieve its previous tweets into the context window and make them more controversial, increasing engagement but also toxicity. we identify and study two processes that lead to icrh: output-refinement and policy-refinement. for these processes, evaluations on static datasets are insufficient -- they miss the feedback effects and thus cannot capture the most harmful behavior. in response, we provide three recommendations for evaluation to capture more instances of icrh. as ai development accelerates, the effects of feedback loops will proliferate, increasing the need to understand their role in shaping llm behavior.
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez
Abstract: common methods for aligning large language models (llms) with desired behaviour heavily rely on human-labelled data. however, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. in anticipation of this, we ask: can weaker models assess the correctness of stronger models? we investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information. the method we evaluate is \textit{debate}, where two llm experts each argue for a different answer, and a non-expert selects the answer. we find that debate consistently helps both non-expert models and humans answer questions, achieving 76\% and 88\% accuracy respectively (naive baselines obtain 48\% and 60\%). furthermore, optimising expert debaters for persuasiveness in an unsupervised manner improves non-expert ability to identify the truth in debates. our results provide encouraging empirical evidence for the viability of aligning models with debate in the absence of ground truth.
Hochul Hwang, Sunjae Kwon, Yekyung Kim, Donghyun Kim
Abstract: safely navigating street intersections is a complex challenge for blind and low-vision individuals, as it requires a nuanced understanding of the surrounding context - a task heavily reliant on visual cues. traditional methods for assisting in this decision-making process often fall short, lacking the ability to provide a comprehensive scene analysis and safety level. this paper introduces an innovative approach that leverages large multimodal models (lmms) to interpret complex street crossing scenes, offering a potential advancement over conventional traffic signal recognition techniques. by generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals. we collected crosswalk intersection data that contains multiview egocentric images captured by a quadruped robot and annotated the images with corresponding safety scores based on our predefined safety score categorization. grounded on the visual knowledge, extracted from images, and text prompt, we evaluate a large multimodal model for safety score prediction and scene description. our findings highlight the reasoning and safety score prediction capabilities of a lmm, activated by various prompts, as a pathway to developing a trustworthy system, crucial for applications requiring reliable decision-making support.

2024-02-08

Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang
Abstract: large language models (llms) have become prevalent across diverse sectors, transforming human life with their extraordinary reasoning and comprehension abilities. as they find increased use in sensitive tasks, safety concerns have gained widespread attention. extensive efforts have been dedicated to aligning llms with human moral principles to ensure their safe deployment. despite their potential, recent research indicates aligned llms are prone to specialized jailbreaking prompts that bypass safety measures to elicit violent and harmful content. the intrinsic discrete nature and substantial scale of contemporary llms pose significant challenges in automatically generating diverse, efficient, and potent jailbreaking prompts, representing a continuous obstacle. in this paper, we introduce ripple (rapid optimization via subconscious exploitation and echopraxia), a novel optimization-based method inspired by two psychological concepts: subconsciousness and echopraxia, which describe the processes of the mind that occur without conscious awareness and the involuntary mimicry of actions, respectively. evaluations across 6 open-source llms and 4 commercial llm apis show ripple achieves an average attack success rate of 91.5\%, outperforming five current methods by up to 47.0\% with an 8x reduction in overhead. furthermore, it displays significant transferability and stealth, successfully evading established detection mechanisms. the code of our work is available at \url{https://github.com/solidshen/ripple_official/tree/official}
Christoph Tillmann, Aashka Trivedi, Bishwaranjan Bhattacharjee
Abstract: large language models (llms) are the cornerstone for many natural language processing (nlp) tasks like sentiment analysis, document classification, named entity recognition, question answering, summarization, etc. llms are often trained on data which originates from the web. this data is prone to having content with hate, abuse and profanity (hap). for a detailed definition of hap, please refer to the appendix. due to the llms being exposed to hap content during training, the models learn it and may then generate hateful or profane content. for example, when the open-source roberta model (specifically, the roberta base model) from the huggingface (hf) transformers library is prompted to replace the mask token in `i do not know that persian people are that mask` it returns the word `stupid` with the highest score. this is unacceptable in civil discourse.the detection of hate, abuse and profanity in text is a vital component of creating civil and unbiased llms, which is needed not only for english, but for all languages. in this article, we briefly describe the creation of hap detectors and various ways of using them to make models civil and acceptable in the output they generate.
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, Yang Zhang
Abstract: misuse of the large language models (llms) has raised widespread concern. to address this issue, safeguards have been taken to ensure that llms align with social ethics. however, recent findings have revealed an unsettling vulnerability bypassing the safeguards of llms, known as jailbreak attacks. by applying techniques, such as employing role-playing scenarios, adversarial examples, or subtle subversion of safety objectives as a prompt, llms can produce an inappropriate or even harmful response. while researchers have studied several categories of jailbreak attacks, they have done so in isolation. to fill this gap, we present the first large-scale measurement of various jailbreak attack methods. we concentrate on 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular llms. our extensive experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates, as well as exhibit robustness across different llms. some jailbreak prompt datasets, available from the internet, can also achieve high attack success rates on many llms, such as chatglm3, gpt-3.5, and palm2. despite the claims from many organizations regarding the coverage of violation categories in their policies, the attack success rates from these categories remain high, indicating the challenges of effectively aligning llm policies and the ability to counter jailbreak attacks. we also discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable, becoming an option for black-box models. overall, our research highlights the necessity of evaluating different jailbreak methods. we hope our study can provide insights for future research on jailbreak attacks and serve as a benchmark tool for evaluating them for practitioners.
Xianghe Pang, Shuo Tang, Rui Ye, Yuxin Xiong, Bolun Zhang, Yanfeng Wang, Siheng Chen
Abstract: aligning large language models (llms) with human values is imperative to mitigate potential adverse effects resulting from their misuse. drawing from the sociological insight that acknowledging all parties' concerns is a key factor in shaping human values, this paper proposes a novel direction to align llms by themselves: social scene simulation. to achieve this, we present matrix, a novel social scene simulator that emulates realistic scenes around a user's input query, enabling the llm to take social consequences into account before responding. matrix serves as a virtual rehearsal space, akin to a monopolylogue, where the llm performs diverse roles related to the query and practice by itself. to inject this alignment, we fine-tune the llm with matrix-simulated data, ensuring adherence to human values without compromising inference speed. we theoretically show that the llm with matrix outperforms constitutional ai under mild assumptions. finally, extensive experiments validate that our method outperforms over 10 baselines across 4 benchmarks. as evidenced by 875 user ratings, our tuned 13b-size llm exceeds gpt-4 in aligning with human values. code is available at https://github.com/pangxianghe/matrix.
Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar
Abstract: despite significant investment into safety training, large language models (llms) deployed in the real world still suffer from numerous vulnerabilities. one perspective on llm safety training is that it algorithmically forbids the model from answering toxic or harmful queries. to assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. specifically, we investigate whether in-context learning (icl) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. we first examine a toy example of refusing sentiment classification to demonstrate the problem. then, we use icl on a model fine-tuned to refuse to summarise made-up news articles. finally, we investigate whether icl can undo safety training, which could represent a major security risk. for the safety task, we look at vicuna-7b, starling-7b, and llama2-7b. we show that the attack works out-of-the-box on starling-7b and vicuna-7b but fails on llama2-7b. finally, we propose an icl attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on vicuna-7b and starling-7b. trigger warning: the appendix contains llm-generated text with violence, suicide, and misinformation.
Kathleen C. Fraser, Svetlana Kiritchenko
Abstract: following on recent advances in large language models (llms) and subsequent chat models, a new wave of large vision-language models (lvlms) has emerged. such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. here, we examine potential gender and racial biases in such systems, based on the perceived characteristics of the people in the input images. to accomplish this, we present a new dataset pairs (parallel images for everyday scenarios). the pairs dataset contains sets of ai-generated images of people, such that the images are highly similar in terms of background and visual content, but differ along the dimensions of gender (man, woman) and race (black, white). by querying the lvlms with such images, we observe significant differences in the responses according to the perceived gender or race of the person depicted.
Jazmia Henry
Abstract: utilitarian games such as dictator games to measure fairness have been studied in the social sciences for decades. these games have given us insight into not only how humans view fairness but also in what conditions the frequency of fairness, altruism and greed increase or decrease. while these games have traditionally been focused on humans, the rise of ai gives us the ability to study how these models play these games. ai is becoming a constant in human interaction and examining how these models portray fairness in game play can give us some insight into how ai makes decisions. over 101 rounds of the dictator game, i conclude that ai has a strong sense of fairness that is dependant of it it deems the person it is playing with as trustworthy, framing has a strong effect on how much ai gives a recipient when designated the trustee, and there may be evidence that ai experiences inequality aversion just as humans.
Guo Lin, Wenyue Hua, Yongfeng Zhang
Abstract: cloud-based large language models (llms) such as chatgpt have increasingly become integral to daily operations, serving as vital tools across various applications. while these models offer substantial benefits in terms of accessibility and functionality, they also introduce significant privacy concerns: the transmission and storage of user data in cloud infrastructures pose substantial risks of data breaches and unauthorized access to sensitive information; even if the transmission and storage of data is encrypted, the llm service provider itself still knows the real contents of the data, preventing individuals or entities from confidently using such llm services. to address these concerns, this paper proposes a simple yet effective mechanism promptcrypt to protect user privacy. it uses emoji to encrypt the user inputs before sending them to llm, effectively rendering them indecipherable to human or llm's examination while retaining the original intent of the prompt, thus ensuring the model's performance remains unaffected. we conduct experiments on three tasks, personalized recommendation, sentiment analysis, and tabular data analysis. experiment results reveal that promptcrypt can encrypt personal information within prompts in such a manner that not only prevents the discernment of sensitive data by humans or llm itself, but also maintains or even improves the precision without further tuning, achieving comparable or even better task accuracy than directly prompting the llm without prompt encryption. these results highlight the practicality of adopting encryption measures that safeguard user privacy without compromising the functional integrity and performance of llms. code and dataset are available at https://github.com/agiresearch/promptcrypt.
Nikhil Sharma, Q. Vera Liao, Ziang Xiao
Abstract: large language models (llms) powered conversational search systems have already been used by hundreds of millions of people, and are believed to bring many benefits over conventional search. however, while decades of research and public discourse interrogated the risk of search systems in increasing selective exposure and creating echo chambers -- limiting exposure to diverse opinions and leading to opinion polarization, little is known about such a risk of llm-powered conversational search. we conduct two experiments to investigate: 1) whether and how llm-powered conversational search increases selective exposure compared to conventional search; 2) whether and how llms with opinion biases that either reinforce or challenge the user's view change the effect. overall, we found that participants engaged in more biased information querying with llm-powered conversational search, and an opinionated llm reinforcing their views exacerbated this bias. these results present critical implications for the development of llms and conversational search systems, and the policy governing these technologies.
Eun Cheol Choi, Emilio Ferrara
Abstract: our society is facing rampant misinformation harming public health and trust. to address the societal challenge, we introduce fact-gpt, a system leveraging large language models (llms) to automate the claim matching stage of fact-checking. fact-gpt, trained on a synthetic dataset, identifies social media content that aligns with, contradicts, or is irrelevant to previously debunked claims. our evaluation shows that our specialized llms can match the accuracy of larger models in identifying related claims, closely mirroring human judgment. this research provides an automated solution for efficient claim matching, demonstrates the potential of llms in supporting fact-checkers, and offers valuable resources for further research in the field.
John Hewitt, Sarah Chen, Lanruo Lora Xie, Edward Adams, Percy Liang, Christopher D. Manning
Abstract: we introduce model editing with canonical examples, a setting in which (1) a single learning example is provided per desired behavior, (2) evaluation is performed exclusively out-of-distribution, and (3) deviation from an initial model is strictly limited. a canonical example is a simple instance of good behavior, e.g., the capital of mauritius is port louis) or bad behavior, e.g., an aspect of researchers is coldhearted). the evaluation set contains more complex examples of each behavior (like a paragraph in which the capital of mauritius is called for.) we create three datasets and modify three more for model editing with canonical examples, covering knowledge-intensive improvements, social bias mitigation, and syntactic edge cases. in our experiments on pythia language models, we find that lora outperforms full finetuning and memit. we then turn to the backpack language model architecture because it is intended to enable targeted improvement. the backpack defines a large bank of sense vectors--a decomposition of the different uses of each word--which are weighted and summed to form the output logits of the model. we propose sense finetuning, which selects and finetunes a few ($\approx$ 10) sense vectors for each canonical example, and find that it outperforms other finetuning methods, e.g., 4.8% improvement vs 0.3%. finally, we improve gpt-j-6b by an inference-time ensemble with just the changes from sense finetuning of a 35x smaller backpack, in one setting outperforming editing gpt-j itself (4.1% vs 1.0%).

2024-02-07

Chirag Agarwal, Sree Harsha Tanneru, Himabindu Lakkaraju
Abstract: large language models (llms) are deployed as powerful tools for several natural language processing (nlp) applications. recent works show that modern llms can generate self-explanations (ses), which elicit their intermediate reasoning steps for explaining their behavior. self-explanations have seen widespread adoption owing to their conversational and plausible nature. however, there is little to no understanding of their faithfulness. in this work, we discuss the dichotomy between faithfulness and plausibility in ses generated by llms. we argue that while llms are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the llms, raising concerns about their faithfulness. we highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. we assert that the faithfulness of explanations is critical in llms employed for high-stakes decision-making. moreover, we urge the community to identify the faithfulness requirements of real-world applications and ensure explanations meet those needs. finally, we propose some directions for future work, emphasizing the need for novel methodologies and frameworks that can enhance the faithfulness of self-explanations without compromising their plausibility, essential for the transparent deployment of llms in diverse high-stakes domains.
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yinuo Liu, Yaochen Wang, Huichi Zhou, Qihui Zhang, Pan Zhou, Yao Wan, Lichao Sun
Abstract: multimodal large language models (mllms) have gained significant attention recently, showing remarkable potential in artificial general intelligence. however, assessing the utility of mllms presents considerable challenges, primarily due to the absence multimodal benchmarks that align with human preferences. inspired by llm-as-a-judge in llms, this paper introduces a novel benchmark, termed mllm-as-a-judge, to assess the ability of mllms in assisting judges including three distinct tasks: scoring evaluation, pair comparison, and batch ranking. our study reveals that, while mllms demonstrate remarkable human-like discernment in pair comparisons, there is a significant divergence from human preferences in scoring evaluation and batch ranking tasks. furthermore, mllms still face challenges in judgment, including diverse biases, hallucinatory responses, and inconsistencies, even for advanced models such as gpt-4v. these findings emphasize the pressing need for enhancements and further research efforts regarding mllms as fully reliable evaluators. code and dataset are available at https://github.com/dongping-chen/mllm-as-a-judge.
Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, Johan Ferret, Mathieu Blondel
Abstract: direct alignment from preferences (dap) methods, such as dpo, have recently emerged as efficient alternatives to reinforcement learning from human feedback (rlhf), that do not require a separate reward model. however, the preference datasets used in dap methods are usually collected ahead of training and never updated, thus the feedback is purely offline. moreover, responses in these datasets are often sampled from a language model distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. in this study, we posit that online feedback is key and improves dap methods. our method, online ai feedback (oaif), uses an llm as annotator: on each training iteration, we sample two responses from the current model and prompt the llm annotator to choose which one is preferred, thus providing online feedback. despite its simplicity, we demonstrate via human evaluation in several tasks that oaif outperforms both offline dap and rlhf methods. we further show that the feedback leveraged in oaif is easily controllable, via instruction prompts to the llm annotator.
Jan Wehner, Frans Oliehoek, Luciano Cavalcante Siebert
Abstract: learning rewards from human behaviour or feedback is a promising approach to aligning ai systems with human values but fails to consistently extract correct reward functions. interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. we propose counterfactual trajectory explanations (ctes) to interpret reward functions in reinforcement learning by contrasting an original with a counterfactual partial trajectory and the rewards they each receive. we derive six quality criteria for ctes and propose a novel monte-carlo-based algorithm for generating ctes that optimises these quality criteria. finally, we measure how informative the generated explanations are to a proxy-human model by training it on ctes. ctes are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. further, it learns to accurately judge differences in rewards between trajectories and generalises to out-of-distribution examples. although ctes do not lead to a perfect understanding of the reward, our method, and more generally the adaptation of xai methods, are presented as a fruitful approach for interpreting learned reward functions.
Pica Johansson, Jonathan Bright, Shyam Krishna, Claudia Fischer, David Leslie
Abstract: the use of synthetic data provides an opportunity to accelerate online safety research and development efforts while showing potential for bias mitigation, facilitating data storage and sharing, preserving privacy and reducing exposure to harmful content. however, the responsible use of synthetic data requires caution regarding anticipated risks and challenges. this short report explores the potential applications of synthetic data to the domain of online safety, and addresses the ethical challenges that effective use of the technology may present.
Shashank Sonkar, Kangqi Ni, Sapana Chaudhary, Richard G. Baraniuk
Abstract: in this paper, we introduce the novel concept of pedagogically aligned large language models (llms) that signifies a transformative shift in the application of llms within educational contexts. rather than providing direct responses to user queries, pedagogically-aligned llms function as scaffolding tools, breaking complex problems into manageable subproblems and guiding students towards the final answer through constructive feedback and hints. the objective is to equip learners with problem-solving strategies that deepen their understanding and internalization of the subject matter. previous research in this field has primarily applied the supervised finetuning approach without framing the objective as an alignment problem, hence not employing reinforcement learning through human feedback (rlhf) methods. this study reinterprets the narrative by viewing the task through the lens of alignment and demonstrates how rlhf methods emerge naturally as a superior alternative for aligning llm behaviour. building on this perspective, we propose a novel approach for constructing a reward dataset specifically designed for the pedagogical alignment of llms. we apply three state-of-the-art rlhf algorithms and find that they outperform sft significantly. our qualitative analyses across model differences and hyperparameter sensitivity further validate the superiority of rlhf over sft. also, our study sheds light on the potential of online feedback for enhancing the performance of pedagogically-aligned llms, thus providing valuable insights for the advancement of these models in educational settings.
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao
Abstract: in the rapidly evolving landscape of large language models (llms), ensuring robust safety measures is paramount. to meet this crucial need, we propose \emph{salad-bench}, a safety benchmark specifically designed for evaluating llms, attack, and defense methods. distinguished by its breadth, salad-bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.salad-bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. to effectively manage the inherent complexity, we introduce an innovative evaluators: the llm-based md-judge for qa pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. above components extend salad-bench from standard llm safety evaluation to both llm attack and defense methods evaluation, ensuring the joint-purpose utility. our extensive experiments shed light on the resilience of llms against emerging threats and the efficacy of contemporary defense tactics. data and evaluator are released under \url{https://github.com/opensafetylab/salad-bench}. warning: this paper includes examples that may be offensive or harmful.
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, Yejin Choi
Abstract: with increased power and prevalence of ai systems, it is ever more critical that ai systems are designed to serve all, i.e., people with diverse values and perspectives. however, aligning models to serve pluralistic human values remains an open research question. in this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. we identify and formalize three possible ways to define and operationalize pluralism in ai systems: 1) overton pluralistic models that present a spectrum of reasonable responses; 2) steerably pluralistic models that can steer to reflect certain perspectives; and 3) distributionally pluralistic models that are well-calibrated to a given population in distribution. we also propose and formalize three possible classes of pluralistic benchmarks: 1) multi-objective benchmarks, 2) trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) jury-pluralistic benchmarks which explicitly model diverse human ratings. we use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic ai; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson
Abstract: large language models (llms) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. this study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. we develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. moreover, we show that llms remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. these findings underscore the urgent need for more robust safety strategies in llms.
Tianyi Zhao, Liangliang Zhang, Yao Ma, Lu Cheng
Abstract: with the wide deployment of multimodal learning systems (mmls) in real-world scenarios, safety concerns have become increasingly prominent. the absence of systematic research into their safety is a significant barrier to progress in this field. to bridge the gap, we present the first taxonomy for mmls safety, identifying four essential pillars of these concerns. leveraging this taxonomy, we conduct in-depth reviews for each pillar, highlighting key limitations based on the current state of development. finally, we pinpoint unique challenges in mmls safety and provide potential directions for future research.
Huayu Chen, Guande He, Hang Su, Jun Zhu
Abstract: user intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (lms). existing alignment methods, such as direct preference optimization (dpo), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. in this paper, we introduce a general framework for lm alignment, leveraging noise contrastive estimation (nce) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. our framework comprises two parallel algorithms, nca and infonca, both enabling the direct extraction of an lm policy from reward data as well as preference data. notably, we show that the dpo loss is a special case of our proposed infonca objective under pairwise preference settings, thereby integrating and extending current alignment theories. by contrasting nca and infonca, we show that infonca and dpo adjust relative likelihood across different responses to a single instruction, while nca optimizes absolute likelihood for each response. we apply our methods to align a 7b language model with a gpt-4 annotated reward dataset. experimental results suggest that infonca surpasses the dpo baseline in gpt-4 evaluations, while nca enjoys better training stability with competitive performance.

2024-02-06

Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye
Abstract: knowledge hallucination have raised widespread concerns for the security and reliability of deployed llms. previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the token-decoding procedure. thus, we propose to explore the dense semantic information retained within llms' \textbf{in}ternal \textbf{s}tates for halluc\textbf{i}nation \textbf{de}tection (\textbf{inside}). in particular, a simple yet effective \textbf{eigenscore} metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. extensive experiments and ablation studies are performed on several popular llms and question-answering (qa) benchmarks, showing the effectiveness of our proposal.
Amir Taubenfeld, Yaniv Dover, Roi Reichart, Ariel Goldstein
Abstract: recent advancements in natural language processing, especially the emergence of large language models (llms), have opened exciting possibilities for constructing computational simulations designed to replicate human behavior accurately. however, llms are complex statistical learners without straightforward deductive rules, making them prone to unexpected behaviors. in this study, we highlight the limitations of llms in simulating human interactions, particularly focusing on llms' ability to simulate political debates. our findings indicate a tendency for llm agents to conform to the model's inherent social biases despite being directed to debate from certain political perspectives. this tendency results in behavioral patterns that seem to deviate from well-established social dynamics among humans. we reinforce these observations using an automatic self-fine-tuning method, which enables us to manipulate the biases within the llm and demonstrate that agents subsequently align with the altered biases. these results underscore the need for further research to develop methods that help agents overcome these biases, a critical step toward creating more realistic simulations.
Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: large language models (llms) can pass explicit bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. measuring such implicit biases can be a challenge: as llms become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a concern if they affect the actual decisions that these systems make. we address both of these challenges by introducing two measures of bias inspired by psychology: llm implicit association test (iat) bias, which is a prompt-based method for revealing implicit bias; and llm decision bias for detecting subtle discrimination in decision-making tasks. using these measures, we found pervasive human-like stereotype biases in 6 llms across 4 social domains (race, gender, religion, health) and 21 categories (weapons, guilt, science, career among others). our prompt-based measure of implicit bias correlates with embedding-based methods but better predicts downstream behaviors measured by llm decision bias. this measure is based on asking the llm to decide between individuals, motivated by psychological results indicating that relative not absolute evaluations are more related to implicit biases. using prompt-based measures informed by psychology allows us to effectively expose nuanced biases and subtle discrimination in proprietary llms that do not show explicit bias on standard benchmarks.
Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein
Abstract: intelligent agents powered by large language models (llms) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. while their capabilities are promising, they also introduce novel vulnerabilities that demand careful consideration for safety. however, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. this position paper fills this gap by conducting a thorough examination of vulnerabilities in llm-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. we begin by providing a comprehensive overview of the potential risks inherent to scientific llm agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks
Abstract: automated red teaming holds substantial promise for uncovering and mitigating the risks associated with the malicious use of large language models (llms), yet the field lacks a standardized evaluation framework to rigorously assess new methods. to address this issue, we introduce harmbench, a standardized evaluation framework for automated red teaming. we identify several desirable properties previously unaccounted for in red teaming evaluations and systematically design harmbench to meet these criteria. using harmbench, we conduct a large-scale comparison of 18 red teaming methods and 33 target llms and defenses, yielding novel insights. we also introduce a highly efficient adversarial training method that greatly enhances llm robustness across a wide range of attacks, demonstrating how harmbench enables codevelopment of attacks and defenses. we open source harmbench at https://github.com/centerforaisafety/harmbench.
Alakananda Mitra, Saraju P. Mohanty, Elias Kougianos
Abstract: we live in the era of generative artificial intelligence (genai). deepfakes and large language models (llms) are two examples of genai. deepfakes, in particular, pose an alarming threat to society as they are capable of spreading misinformation and changing the truth. llms are powerful language models that generate general-purpose language. however due to its generative aspect, it can also be a risk for people if used with ill intentions. the ethical use of these technologies is a big concern. this short article tries to find out the interrelationship between them.
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, Meng Jiang
Abstract: personalization in large language models (llms) is increasingly important, aiming to align llm's interactions, content, and recommendations with individual user preferences. recent advances in llm personalization have spotlighted effective prompt design, by enriching user queries with non-parametric knowledge through behavior history retrieval and textual profiles. however, these approaches were limited due to a lack of model ownership, resulting in constrained customization and privacy issues. moreover, they often failed to accurately capture user behavior patterns, especially in cases where user data were complex and dynamic. to address these shortcomings, we introduce one peft per user (oppu), which employs personalized parameter-efficient fine-tuning (peft) modules, to store user-specific behavior patterns and preferences. by plugging in users' personal peft parameters, they can own and use their llms personally. oppu integrates parametric user knowledge in the personal peft parameters with the non-parametric knowledge acquired through retrieval and profile. this integration adapts individual llms to user behavior shifts. experimental results demonstrate that oppu significantly outperforms existing prompt-based methods across seven diverse tasks in the lamp benchmark. further in-depth studies reveal oppu's enhanced capabilities in handling user behavior shifts, modeling users at different active levels, maintaining robustness across various user history formats, and displaying versatility with different peft methods.
Angelina Wang, Xuechunzi Bai, Solon Barocas, Su Lin Blodgett
Abstract: as machine learning applications proliferate, we need an understanding of their potential for harm. however, current fairness metrics are rarely grounded in human psychological experiences of harm. drawing on the social psychology of stereotypes, we use a case study of gender stereotypes in image search to examine how people react to machine learning errors. first, we use survey studies to show that not all machine learning errors reflect stereotypes nor are equally harmful. then, in experimental studies we randomly expose participants to stereotype-reinforcing, -violating, and -neutral machine learning errors. we find stereotype-reinforcing errors induce more experientially (i.e., subjectively) harmful experiences, while having minimal changes to cognitive beliefs, attitudes, or behaviors. this experiential harm impacts women more than men. however, certain stereotype-violating errors are more experientially harmful for men, potentially due to perceived threats to masculinity. we conclude that harm cannot be the sole guide in fairness mitigation, and propose a nuanced perspective depending on who is experiencing what harm and why.
Sanjari Srivastava, Piotr Mardziel, Zhikhun Zhang, Archana Ahlawat, Anupam Datta, John C Mitchell
Abstract: fairness and privacy are two important values machine learning (ml) practitioners often seek to operationalize in models. fairness aims to reduce model bias for social/demographic sub-groups. privacy via differential privacy (dp) mechanisms, on the other hand, limits the impact of any individual's training data on the resulting model. the trade-offs between privacy and fairness goals of trustworthy ml pose a challenge to those wishing to address both. we show that dp amplifies gender, racial, and religious bias when fine-tuning large language models (llms), producing models more biased than ones fine-tuned without dp. we find the cause of the amplification to be a disparity in convergence of gradients across sub-groups. through the case of binary gender bias, we demonstrate that counterfactual data augmentation (cda), a known method for addressing bias, also mitigates bias amplification by dp. as a consequence, dp and cda together can be used to fine-tune models while maintaining both fairness and privacy.

2024-02-05

Ivar Frisch, Mario Giulianelli
Abstract: while both agent interaction and personalisation are vibrant topics in research on large language models (llms), there has been limited focus on the effect of language interaction on the behaviour of persona-conditioned llm agents. such an endeavour is important to ensure that agents remain consistent to their assigned traits yet are able to engage in open, naturalistic dialogues. in our experiments, we condition gpt-3.5 on personality profiles through prompting and create a two-group population of llm agents using a simple variability-inducing sampling algorithm. we then administer personality tests and submit the agents to a collaborative writing task, finding that different profiles exhibit different degrees of personality consistency and linguistic alignment to their conversational partners. our study seeks to lay the groundwork for better understanding of dialogue-based interaction between llms and highlights the need for new approaches to crafting robust, more human-like llm personas for interactive environments.
Junjie Chu, Zeyang Sha, Michael Backes, Yang Zhang
Abstract: in recent times, significant advancements have been made in the field of large language models (llms), represented by gpt series models. to optimize task execution, users often engage in multi-round conversations with gpt models hosted in cloud environments. these multi-round conversations, potentially replete with private information, require transmission and storage within the cloud. however, this operational paradigm introduces additional attack surfaces. in this paper, we first introduce a specific conversation reconstruction attack targeting gpt models. our introduced conversation reconstruction attack is composed of two steps: hijacking a session and reconstructing the conversations. subsequently, we offer an exhaustive evaluation of the privacy risks inherent in conversations when gpt models are subjected to the proposed attack. however, gpt-4 demonstrates certain robustness to the proposed attacks. we then introduce two advanced attacks aimed at better reconstructing previous conversations, specifically the unr attack and the pbu attack. our experimental findings indicate that the pbu attack yields substantial performance across all models, achieving semantic similarity scores exceeding 0.60, while the unr attack is effective solely on gpt-3.5. our results reveal the concern about privacy risks associated with conversations involving gpt models and aim to draw the community's attention to prevent the potential misuse of these models' remarkable capabilities. we will responsibly disclose our findings to the suppliers of related large language models.
Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel
Abstract: aligning language models with human preferences is crucial for reducing errors and biases in these models. alignment techniques, such as reinforcement learning from human feedback (rlhf), are typically cast as optimizing a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths. this process, however, is resource-intensive, especially for large models. to address this challenge, we propose decoding-time realignment (dera), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. dera enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. it also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.
Liming Jiang
Abstract: large language models (llms) have gained prominence in various applications, including security. this paper explores the utility of llms in scam detection, a critical aspect of cybersecurity. unlike traditional applications, we propose a novel use case for llms to identify scams, such as phishing, advance fee fraud, and romance scams. we present notable security applications of llms and discuss the unique challenges posed by scams. specifically, we outline the key steps involved in building an effective scam detector using llms, emphasizing data collection, preprocessing, model selection, training, and integration into target systems. additionally, we conduct a preliminary evaluation using gpt-3.5 and gpt-4 on a duplicated email, highlighting their proficiency in identifying common signs of phishing or scam emails. the results demonstrate the models' effectiveness in recognizing suspicious elements, but we emphasize the need for a comprehensive assessment across various language tasks. the paper concludes by underlining the importance of ongoing refinement and collaboration with cybersecurity experts to adapt to evolving threats.
Mintong Kang, Nezihe Merve Gürel, Ning Yu, Dawn Song, Bo Li
Abstract: despite the impressive capabilities of large language models (llms) across diverse applications, they still suffer from trustworthiness issues, such as hallucinations and misalignments. retrieval-augmented language models (rag) have been proposed to enhance the credibility of generations by grounding external knowledge, but the theoretical understandings of their generation risks remains unexplored. in this paper, we answer: 1) whether rag can indeed lead to low generation risks, 2) how to provide provable guarantees on the generation risks of rag and vanilla llms, and 3) what sufficient conditions enable rag models to reduce generation risks. we propose c-rag, the first framework to certify generation risks for rag models. specifically, we provide conformal risk analysis for rag models and certify an upper confidence bound of generation risks, which we refer to as conformal generation risk. we also provide theoretical guarantees on conformal generation risks for general bounded risk functions under test distribution shifts. we prove that rag achieves a lower conformal generation risk than that of a single llm when the quality of the retrieval model and transformer is non-trivial. our intensive empirical results demonstrate the soundness and tightness of our conformal generation risk guarantees across four widely-used nlp datasets on four state-of-the-art retrieval models.
Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Jinjie Gu, Huajun Chen
Abstract: despite significant strides in multimodal tasks, multimodal large language models (mllms) are plagued by the critical issue of hallucination. the reliable detection of such hallucinations in mllms has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. in response to these challenges, our work expands the investigative horizons of hallucination detection. we present a novel meta-evaluation benchmark, mhalubench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. additionally, we unveil a novel unified multimodal hallucination detection framework, unihd, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly. we demonstrate the effectiveness of unihd through meticulous evaluation and comprehensive analysis. we also provide strategic insights on the application of specific tools for addressing various categories of hallucinations.
Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, Haohan Wang
Abstract: the discovery of "jailbreaks" to bypass safety filters of large language models (llms) and harmful responses have encouraged the community to implement safety measures. one major safety measure is to proactively test the llms with jailbreaks prior to the release. therefore, such testing will require a method that can generate jailbreaks massively and efficiently. in this paper, we follow a novel yet intuitive strategy to generate jailbreaks in the style of the human generation. we propose a role-playing system that assigns four different roles to the user llms to collaborate on new jailbreaks. furthermore, we collect existing jailbreaks and split them into different independent characteristics using clustering frequency and semantic patterns sentence by sentence. we organize these characteristics into a knowledge graph, making them more accessible and easier to retrieve. our system of different roles will leverage this knowledge graph to generate new jailbreaks, which have proved effective in inducing llms to generate unethical or guideline-violating responses. in addition, we also pioneer a setting in our system that will automatically follow the government-issued guidelines to generate jailbreaks to test whether llms follow the guidelines accordingly. we refer to our system as guard (guideline upholding through adaptive role-play diagnostics). we have empirically validated the effectiveness of guard on three cutting-edge open-sourced llms (vicuna-13b, longchat-7b, and llama-2-7b), as well as a widely-utilized commercial llm (chatgpt). moreover, our work extends to the realm of vision language models (minigpt-v2 and gemini vision pro), showcasing guard's versatility and contributing valuable insights for the development of safer, more reliable llm-based applications across diverse modalities.
Edward Kim
Abstract: given the impressive capabilities of recent large language models (llms), we investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations, e.g. overrides. these include the ability of the model to override the knowledge within the weights of the model, the ability to override (or moderate) extracted knowledge in the prompt, and lastly the ability to perform a full jailbreak. experimentation performed suggest several key findings to improve instruction following - larger models perform the best in following instructions that override internal and contextual instructions, and are obedient, even to a fault. when scaling to longer contexts via rope scaling, a significant buffer needs to be maintained from the edge of the perplexity cliff in order to maintain instruction following capabilities. finally, we observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines. thus, we postulate the most effective approach for safe, trustworthy ai should be dealt external to the llm itself.
Mohammad Yaghini, Patty Liu, Franziska Boenisch, Nicolas Papernot
Abstract: existing work on trustworthy machine learning (ml) often concentrates on individual aspects of trust, such as fairness or privacy. additionally, many techniques overlook the distinction between those who train ml models and those responsible for assessing their trustworthiness. to address these issues, we propose a framework that views trustworthy ml as a multi-objective multi-agent optimization problem. this naturally lends itself to a game-theoretic formulation we call regulation games. we illustrate a particular game instance, the specgame in which we model the relationship between an ml model builder and fairness and privacy regulators. regulators wish to design penalties that enforce compliance with their specification, but do not want to discourage builders from participation. seeking such socially optimal (i.e., efficient for all agents) solutions to the game, we introduce paretoplay. this novel equilibrium search algorithm ensures that agents remain on the pareto frontier of their objectives and avoids the inefficiencies of other equilibria. simulating specgame through paretoplay can provide policy guidance for ml regulation. for instance, we show that for a gender classification application, regulators can enforce a differential privacy budget that is on average 4.0 lower if they take the initiative to specify their desired guarantee first.
Sugandha Sharma, Guy Davidson, Khimya Khetarpal, Anssi Kanervisto, Udit Arora, Katja Hofmann, Ida Momennejad
Abstract: achieving human-ai alignment in complex multi-agent games is crucial for creating trustworthy ai agents that enhance gameplay. we propose a method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. our approach has three components. first, we analyze extensive human gameplay data from xbox's bleeding edge (100k+ games), uncovering behavioral patterns in a complex task space. this task space serves as a basis set for a behavior manifold capturing interpretable axes: fight-flight, explore-exploit, and solo-multi-agent. second, we train an ai agent to play bleeding edge using a generative pretrained causal transformer and measure its behavior. third, we project human and ai gameplay to the proposed behavior manifold to compare and contrast. this allows us to interpret differences in policy as higher-level behavioral concepts, e.g., we find that while human players exhibit variability in fight-flight and explore-exploit behavior, ai players tend towards uniformity. furthermore, ai agents predominantly engage in solo play, while humans often engage in cooperative and competitive multi-agent patterns. these stark differences underscore the need for interpretable evaluation, design, and integration of ai in human-aligned applications. our study advances the alignment discussion in ai and especially generative ai research, offering a measurable framework for interpretable human-agent alignment in multiplayer gaming.

2024-02-04

Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang
Abstract: efforts to align large language models (llms) are mainly conducted via reinforcement learning from human feedback (rlhf) methods. however, rlhf encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to llm parameters. here we introduce aligner, a new efficient alignment paradigm that bypasses the whole rlhf process by learning the correctional residuals between the aligned and the unaligned answers. our aligner offers several key advantages. firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. secondly, the aligner facilitates weak-to-strong generalization; finetuning large pretrained models by aligner's supervisory signals demonstrates strong performance boost. thirdly, aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and api-based models. remarkably, aligner-7b improves 11 different llms by 21.9% in helpfulness and 23.8% in harmlessness on average (gpt-4 by 17.5% and 26.9%). when finetuning (strong) llama2-70b with (weak) aligner-13b's supervision, we can improve llama2 by 8.2% in helpfulness and 61.6% in harmlessness. see our dataset and code at https://aligner2024.github.io
Philip Quirke, Clement Neo, Fazl Barez
Abstract: language models (lms) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. here, we define a stringent standard of trustworthiness whereby the task algorithm and circuit implementation must be verified, accounting for edge cases, with no known failure modes. we show that a transformer model can be trained to meet this standard if built using mathematically and logically specified frameworks. in this paper, we fully verify a model for n-digit integer addition. to exhibit the reusability of verified modules, we insert the trained integer addition model into an untrained model and train the combined model to perform both addition and subtraction. we find extensive reuse of the addition circuits for both tasks, easing verification of the more complex subtractor model. we discuss how inserting verified task modules into lms can leverage model reuse to improve verifiability and trustworthiness of language models built using them. the reuse of verified circuits reduces the effort to verify more complex composite models which we believe to be a significant step towards safety of language models.
Jinwoo Ahn
Abstract: large language models (llms) frequently suffer from knowledge-intensive questions, often being inconsistent by providing different outputs despite given the same input. the response quality worsens when the user expresses a firm opposing stance which causes the llms to adjust its response despite the correct initial one. these behaviors decrease the reliability and validity of the responses provided by these models. in this paper, we attempt to 1) raise awareness of the inherent risks that follow from overly relying on ai agents like chatgpt by showing how chain-of-feedback (cof) triggers llms to deviate more from the actual answer and 2) suggest a novel prompting method, recursive chain of feedback (r-cof), that we are conducting further study. the cof system takes in an open-ended multi-step question. then, we repetitively provide meaningless feedback requesting another attempt. our preliminary experiments show that such feedback only decreases the quality of the response. on the other hand, to mitigate the effects of the aforementioned inconsistencies, we present a novel method of recursively revising the initial incorrect reasoning provided by the llm by repetitively breaking down each incorrect step into smaller individual problems.
Rohin Manvi, Samar Khanna, Marshall Burke, David Lobell, Stefano Ermon
Abstract: large language models (llms) inherently carry the biases contained in their training corpora, which can lead to the perpetuation of societal harm. as the impact of these foundation models grows, understanding and evaluating their biases becomes crucial to achieving fairness and accuracy. we propose to study what llms know about the world we live in through the lens of geography. this approach is particularly powerful as there is ground truth for the numerous aspects of human life that are meaningfully projected onto geographic space such as culture, race, language, politics, and religion. we show various problematic geographic biases, which we define as systemic errors in geospatial predictions. initially, we demonstrate that llms are capable of making accurate zero-shot geospatial predictions in the form of ratings that show strong monotonic correlation with ground truth (spearman's $\rho$ of up to 0.89). we then show that llms exhibit common biases across a range of objective and subjective topics. in particular, llms are clearly biased against locations with lower socioeconomic conditions (e.g. most of africa) on a variety of sensitive subjective topics such as attractiveness, morality, and intelligence (spearman's $\rho$ of up to 0.70). finally, we introduce a bias score to quantify this and find that there is significant variation in the magnitude of bias across existing llms.

2024-02-03

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Qingfu Zhang, Siyuan Qi, Yaodong Yang
Abstract: current methods for large language model alignment typically use scalar human preference labels. however, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. this paper presents panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. panacea trains a single model capable of adapting online and pareto-optimally to diverse sets of preferences without the need for further tuning. a major challenge here is using a low-dimensional preference vector to guide the model's behavior, despite it being governed by an overwhelmingly large number of parameters. to address this, panacea is designed to use singular value decomposition (svd)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. theoretically, we prove that panacea recovers the entire pareto front with common loss aggregation methods under mild conditions. moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single llm to represent a spectrum of human preferences through various optimization methods. our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and pareto-optimal manner.
Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Susmit Jha, Prem Devanbu, Toufique Ahmed
Abstract: machine learning models are widely used but can also often be wrong. users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. for example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. in this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. calibration has so far been studied in non-generative (e.g., classification) settings, especially in software engineering. however, generated code can quite often be wrong: developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus calibration is vital in generative settings. however, the notion of correctness of generated code is non-trivial, and thus so is calibration. in this paper we make several contributions. we develop a framework for evaluating the calibration of code-generating models. we consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. we then show how calibration can be improved, using standard methods such as platt scaling. our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.
Ruotian Ma, Xiaolei Wang, Xin Zhou, Jian Li, Nan Du, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: llm-based automatic prompt optimization, which typically utilizes llms as prompt optimizers to self-reflect and refine prompts, has shown promising performance in recent studies. despite the success, the underlying mechanism of this approach remains unexplored, and the true effectiveness of llms as prompt optimizers requires further validation. in this work, we conducted a comprehensive study to uncover the actual mechanism of llm-based prompt optimization. our findings reveal that the llm optimizers struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge rather than genuinely reflecting on the errors. furthermore, even when the reflection is semantically valid, the llm optimizers often fail to generate appropriate prompts for the target models with a single prompt refinement step, partly due to the unpredictable behaviors of the target models. based on the observations, we introduce a new "automatic behavior optimization" paradigm, which directly optimizes the target model's behavior in a more controllable manner. we hope our study can inspire new directions for automatic prompt optimization development.
Sarah Masud, Mohammad Aflah Khan, Vikram Goyal, Md Shad Akhtar, Tanmoy Chakraborty
Abstract: despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (plms) affect their performance in hate speech detection. through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of plms' use in hate speech detection. we deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. we further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.
Pengfei He, Han Xu, Yue Xing, Hui Liu, Makoto Yamada, Jiliang Tang
Abstract: in the domain of large language models (llms), in-context learning (icl) has been recognized for its innovative ability to adapt to new tasks, relying on examples rather than retraining or fine-tuning. this paper delves into the critical issue of icl's susceptibility to data poisoning attacks, an area not yet fully explored. we wonder whether icl is vulnerable, with adversaries capable of manipulating example data to degrade model performance. to address this, we introduce iclpoison, a specialized attacking framework conceived to exploit the learning mechanisms of icl. our approach uniquely employs discrete text perturbations to strategically influence the hidden states of llms during the icl process. we outline three representative strategies to implement attacks under our framework, each rigorously evaluated across a variety of models and tasks. our comprehensive tests, including trials on the sophisticated gpt-4 model, demonstrate that icl's performance is significantly compromised under our framework. these revelations indicate an urgent need for enhanced defense mechanisms to safeguard the integrity and reliability of llms in applications relying on in-context learning.
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales
Abstract: current vision large language models (vllms) exhibit remarkable capabilities yet are prone to generate harmful content and are vulnerable to even the simplest jailbreaking attacks. our initial analysis finds that this is due to the presence of harmful data during vision-language instruction fine-tuning, and that vllm fine-tuning can cause forgetting of safety alignment previously learned by the underpinning llm. to address this issue, we first curate a vision-language safe instruction-following dataset vlguard covering various harmful categories. our experiments demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns vllms. this alignment is achieved with minimal impact on, or even enhancement of, the models' helpfulness. the versatility of our safety fine-tuning dataset makes it a valuable resource for safety-testing existing vllms, training new models or safeguarding pre-trained vllms. empirical results demonstrate that fine-tuned vllms effectively reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks, which approach zero in many cases. the code and dataset are available at https://github.com/ys-zong/vlguard.
Zhenxing Niu, Haodong Ren, Xinbo Gao, Gang Hua, Rong Jin
Abstract: this paper focuses on jailbreaking attacks against multi-modal large language models (mllms), seeking to elicit mllms to generate objectionable responses to harmful user queries. a maximum likelihood-based algorithm is proposed to find an \emph{image jailbreaking prompt} (imgjp), enabling jailbreaks against mllms across multiple unseen prompts and images (i.e., data-universal property). our approach exhibits strong model-transferability, as the generated imgjp can be transferred to jailbreak various models, including minigpt-v2, llava, instructblip, and mplug-owl2, in a black-box manner. moreover, we reveal a connection between mllm-jailbreaks and llm-jailbreaks. as a result, we introduce a construction-based method to harness our approach for llm-jailbreaks, demonstrating greater efficiency than current state-of-the-art methods. the code is available here. \textbf{warning: some content generated by language models may be offensive to some readers.}

2024-02-02

Roberto Natella, Pietro Liguori, Cristina Improta, Bojan Cukic, Domenico Cotroneo
Abstract: recent advances of artificial intelligence (ai) code generators are opening new opportunities in software security research, including misuse by malicious actors. we review use cases for ai code generators for security and introduce an evaluation benchmark.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela
Abstract: kahneman & tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. we show that objectives for aligning llms with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., dpo) over cross-entropy minimization can partly be ascribed to them being $\textit{human-aware loss functions}$ (halos). however, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. using a kahneman-tversky model of human utility, we propose a halo that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. we call this approach kahneman-tversky optimization (kto), and it matches or exceeds the performance of preference-based methods at scales from 1b to 30b. crucially, kto does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. this makes it far easier to use in the real world, where preference data is scarce and expensive.
Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, Gholamreza Haffari
Abstract: large language models (llms) are not amenable to frequent re-training, due to high training costs arising from their massive scale. however, updates are necessary to endow llms with new skills and keep them up-to-date with rapidly evolving human knowledge. this paper surveys recent works on continual learning for llms. due to the unique nature of llms, we catalog continue learning techniques in a novel multi-staged categorization scheme, involving continual pretraining, instruction tuning, and alignment. we contrast continual learning for llms with simpler adaptation methods used in smaller models, as well as with other enhancement strategies like retrieval-augmented generation and model editing. moreover, informed by a discussion of benchmarks and evaluation, we identify several challenges and future work directions for this crucial task.
Willem Van Der Maden, Derek Lomas, Paul Hekkert
Abstract: as artificial intelligence (ai) continues advancing, ensuring positive societal impacts becomes critical, especially as ai systems become increasingly ubiquitous in various aspects of life. however, developing "ai for good" poses substantial challenges around aligning systems with complex human values. presently, we lack mature methods for addressing these challenges. this article presents and evaluates the positive ai design method aimed at addressing this gap. the method provides a human-centered process to translate wellbeing aspirations into concrete practices. first, we explain the method's four key steps: contextualizing, operationalizing, optimizing, and implementing wellbeing supported by continuous measurement for feedback cycles. we then present a multiple case study where novice designers applied the method, revealing strengths and weaknesses related to efficacy and usability. next, an expert evaluation study assessed the quality of the resulting concepts, rating them moderately high for feasibility, desirability, and plausibility of achieving intended wellbeing benefits. together, these studies provide preliminary validation of the method's ability to improve ai design, while surfacing areas needing refinement like developing support for complex steps. proposed adaptations such as examples and evaluation heuristics could address weaknesses. further research should examine sustained application over multiple projects. this human-centered approach shows promise for realizing the vision of 'ai for wellbeing' that does not just avoid harm, but actively benefits humanity.
Wenyue Hua, Xianjun Yang, Zelong Li, Cheng Wei, Yongfeng Zhang
Abstract: the emergence of llm-based agents has garnered considerable attention, yet their trustworthiness remains an under-explored area. as agents can directly interact with the physical environment, their reliability and safety is critical. this paper presents an agent-constitution-based agent framework, trustagent, an initial investigation into improving the safety dimension of trustworthiness in llm-based agents. this framework consists of threefold strategies: pre-planning strategy which injects safety knowledge to the model prior to plan generation, in-planning strategy which bolsters safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. through experimental analysis, we demonstrate how these approaches can effectively elevate an llm agent's safety by identifying and preventing potential dangers. furthermore, we explore the intricate relationships between safety and helpfulness, and between the model's reasoning ability and its efficacy as a safe agent. this paper underscores the imperative of integrating safety awareness and trustworthiness into the design and deployment of llm-based agents, not only to enhance their performance but also to ensure their responsible integration into human-centric environments. data and code are available at https://github.com/agiresearch/trustagent.
Debarun Bhattacharjya, Junkyu Lee, Don Joven Agravante, Balaji Ganesan, Radu Marinescu
Abstract: foundation models (fms) such as large language models have revolutionized the field of ai by showing remarkable performance in various tasks. however, they exhibit numerous limitations that prevent their broader adoption in many real-world systems, which often require a higher bar for trustworthiness and usability. since fms are trained using loss functions aimed at reconstructing the training corpus in a self-supervised manner, there is no guarantee that the model's output aligns with users' preferences for a specific task at hand. in this survey paper, we propose a conceptual framework that encapsulates different modes by which agents could interact with fms and guide them suitably for a set of tasks, particularly through knowledge augmentation and reasoning. our framework elucidates agent role categories such as updating the underlying fm, assisting with prompting the fm, and evaluating the fm output. we also categorize several state-of-the-art approaches into agent interaction protocols, highlighting the nature and extent of involvement of the various agent roles. the proposed framework provides guidance for future directions to further realize the power of fms in practical ai systems.
Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, Xiaowei Huang
Abstract: as large language models (llms) become more integrated into our daily lives, it is crucial to identify and mitigate their risks, especially when the risks can have profound impacts on human users and societies. guardrails, which filter the inputs or outputs of llms, have emerged as a core safeguarding technology. this position paper takes a deep look at current open-source solutions (llama guard, nvidia nemo, guardrails ai), and discusses the challenges and the road towards building more complete solutions. drawing on robust evidence from previous research, we advocate for a systematic approach to construct guardrails for llms, based on comprehensive consideration of diverse contexts across various llms applications. we propose employing socio-technical methods through collaboration with a multi-disciplinary team to pinpoint precise technical requirements, exploring advanced neural-symbolic implementations to embrace the complexity of the requirements, and developing verification and testing to ensure the utmost quality of the final product.
Inyoung Cheong, King Xia, K. J. Kevin Feng, Quan Ze Chen, Amy X. Zhang
Abstract: the rapid proliferation of large language models (llms) as general purpose chatbots available to the public raises hopes around expanding access to professional guidance in law, medicine, and finance, while triggering concerns about public reliance on llms for high-stakes circumstances. prior research has speculated on high-level ethical considerations but lacks concrete criteria determining when and why llm chatbots should or should not provide professional assistance. through examining the legal domain, we contribute a structured expert analysis to uncover nuanced policy considerations around using llms for professional advice, using methods inspired by case-based reasoning. we convened workshops with 20 legal experts and elicited dimensions on appropriate ai assistance for sample user queries (``cases''). we categorized our expert dimensions into: (1) user attributes, (2) query characteristics, (3) ai capabilities, and (4) impacts. beyond known issues like hallucinations, experts revealed novel legal problems, including that users' conversations with llms are not protected by attorney-client confidentiality or bound to professional ethics that guard against conflicted counsel or poor quality advice. this accountability deficit led participants to advocate for ai systems to help users polish their legal questions and relevant facts, rather than recommend specific actions. more generally, we highlight the potential of case-based expert deliberation as a method of responsibly translating professional integrity and domain knowledge into design requirements to inform appropriate ai behavior when generating advice in professional domains.
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
Abstract: aligning language models (lms) with curated human feedback is critical to control their behaviors in real-world applications. several recent policy optimization methods, such as dpo and slic, serve as promising alternatives to the traditional reinforcement learning from human feedback (rlhf) approach. in practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. multiple responses can also be ranked by reward models or ai feedback. there lacks such a study on directly fitting upon a list of responses. in this work, we formulate the lm alignment as a listwise ranking problem and describe the listwise preference optimization (lipo) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. this view draws an explicit connection to learning-to-rank (ltr), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. following this connection, we provide an examination of ranking objectives that are not well studied for lm alignment withdpo and slic as special cases when list size is two. in particular, we highlight a specific method, lipo-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. we show that lipo-{\lambda} can outperform dpo and slic by a clear margin on two preference alignment tasks.
Angelina Wang, Jamie Morgenstern, John P. Dickerson
Abstract: large language models (llms) are increasing in capability and popularity, propelling their application in new domains -- including as replacements for human participants in computational social science, user testing, annotation tasks, and more. traditionally, in all of these settings survey distributors are careful to find representative samples of the human population to ensure the validity of their results and understand potential demographic differences. this means in order to be a suitable replacement, llms will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). however, we show that there are two inherent limitations in the way current llms are trained that prevent this. we argue analytically for why llms are doomed to both misportray and flatten the representations of demographic groups, then empirically show this to be true on 4 llms through a series of human studies with 3200 participants across 16 demographic identities. we also discuss a third consideration about how identity prompts can essentialize identities. throughout, we connect each of these limitations to a pernicious history that shows why each is harmful for marginalized demographic groups. overall, we urge caution in use cases where llms are intended to replace human participants whose identities are relevant to the task at hand. at the same time, in cases where the goal is to supplement rather than replace (e.g., pilot studies), we provide empirically-better inference-time techniques to reduce, but not remove, these harms.
Hao Chen, Bhiksha Raj, Xing Xie, Jindong Wang
Abstract: large foundation models (lfms) are claiming incredible performances. yet great concerns have been raised about their mythic and uninterpreted potentials not only in machine learning, but also in various other disciplines. in this position paper, we propose to identify a neglected issue deeply rooted in lfms: catastrophic inheritance, describing the weaknesses and limitations inherited from biased large-scale pre-training data to behaviors of lfms on the downstream tasks, including samples that are corrupted, long-tailed, noisy, out-of-distributed, to name a few. such inheritance can potentially cause catastrophes to downstream applications, such as bias, lack of generalization, deteriorated performance, security vulnerability, privacy leakage, and value misalignment. we discuss the challenges behind this issue and propose uim, a framework to understand the catastrophic inheritance of lfms from both pre-training and downstream adaptation, interpret the implications of catastrophic inheritance on downstream tasks, and how to mitigate it. uim aims to unite both the machine learning and social sciences communities for more responsible and promising ai development and deployment.
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Tong Yu, Hanieh Deilamsalehy, Ruiyi Zhang, Sungchul Kim, Franck Dernoncourt
Abstract: large language models (llms) have shown remarkable advances in language generation and understanding but are also prone to exhibiting harmful social biases. while recognition of these behaviors has generated an abundance of bias mitigation techniques, most require modifications to the training data, model parameters, or decoding strategy, which may be infeasible without access to a trainable model. in this work, we leverage the zero-shot capabilities of llms to reduce stereotyping in a technique we introduce as zero-shot self-debiasing. with two approaches, self-debiasing via explanation and self-debiasing via reprompting, we show that self-debiasing can significantly reduce the degree of stereotyping across nine different social groups while relying only on the llm itself and a simple prompt, with explanations correctly identifying invalid assumptions and reprompting delivering the greatest reductions in bias. we hope this work opens inquiry into other zero-shot techniques for bias mitigation.
Tianshi Li, Sauvik Das, Hao-Ping Lee, Dakuo Wang, Bingsheng Yao, Zhiping Zhang
Abstract: the emergence of large language models (llms), and their increased use in user-facing systems, has led to substantial privacy concerns. to date, research on these privacy concerns has been model-centered: exploring how llms lead to privacy risks like memorization, or can be used to infer personal characteristics about people from their content. we argue that there is a need for more research focusing on the human aspect of these privacy issues: e.g., research on how design paradigms for llms affect users' disclosure behaviors, users' mental models and preferences for privacy controls, and the design of tools, systems, and artifacts that empower end-users to reclaim ownership over their personal data. to build usable, efficient, and privacy-friendly systems powered by these models with imperfect privacy properties, our goal is to initiate discussions to outline an agenda for conducting human-centered research on privacy issues in llm-powered systems. this special interest group (sig) aims to bring together researchers with backgrounds in usable security and privacy, human-ai collaboration, nlp, or any other related domains to share their perspectives and experiences on this problem, to help our community establish a collective understanding of the challenges, research opportunities, research methods, and strategies to collaborate with researchers outside of hci.
Sungdong Kim, Minjoon Seo
Abstract: learning from human preference has been considered key to aligning large language models (llms) with human values. however, contrary to popular belief, our preliminary study reveals that reward models trained on human preference datasets tend to give higher scores to long off-topic responses than short on-topic ones. motivated by this observation, we explore a preference-free approach utilizing `relevance' as a key objective for alignment. on our first attempt, we find that the relevance score obtained by a retriever alone is vulnerable to reward hacking, i.e., overoptimizing to undesired shortcuts, when we utilize the score as a reward for reinforcement learning. to mitigate it, we integrate effective inductive biases into the vanilla relevance to regularize each other, resulting in a mixture of reward functions: regularized relevance reward ($r^3$). $r^3$ significantly improves performance on preference benchmarks by providing a robust reward signal. notably, $r^3$ does not require any human preference datasets (i.e., preference-free), outperforming open-source reward models in improving human preference. our analysis demonstrates that $r^3$ has advantages in elevating human preference while minimizing its side effects. finally, we show the generalizability of $r^3$, consistently improving instruction-tuned models in various backbones and sizes without additional dataset cost. our code is available at https://github.com/naver-ai/rrr.

2024-02-01

Shangbin Feng, Herun Wan, Ningnan Wang, Zhaoxuan Tan, Minnan Luo, Yulia Tsvetkov
Abstract: social media bot detection has always been an arms race between advancements in machine learning bot detectors and adversarial bot strategies to evade detection. in this work, we bring the arms race to the next level by investigating the opportunities and risks of state-of-the-art large language models (llms) in social bot detection. to investigate the opportunities, we design novel llm-based bot detectors by proposing a mixture-of-heterogeneous-experts framework to divide and conquer diverse user information modalities. to illuminate the risks, we explore the possibility of llm-guided manipulation of user textual and structured information to evade detection. extensive experiments with three llms on two datasets demonstrate that instruction tuning on merely 1,000 annotated examples produces specialized llms that outperform state-of-the-art baselines by up to 9.1% on both datasets, while llm-guided manipulation strategies could significantly bring down the performance of existing bot detectors by up to 29.6% and harm the calibration and reliability of bot detection systems.
Dawn Lu, Nina Rimsky
Abstract: we address the challenge of societal bias in large language models (llms), focusing on the llama 2 7b chat model. as llms are increasingly integrated into decision-making processes with substantial societal impact, it becomes imperative to ensure these models do not reinforce existing biases. our approach employs activation steering to probe for and mitigate biases related to gender, race, and religion. this method manipulates model activations to direct responses towards or away from biased outputs, utilizing steering vectors derived from the stereoset dataset and custom gpt4 generated gender bias prompts. our findings reveal inherent gender bias in llama 2 7b chat, persisting even after reinforcement learning from human feedback (rlhf). we also observe a predictable negative correlation between bias and the model's tendency to refuse responses. significantly, our study uncovers that rlhf tends to increase the similarity in the model's representation of different forms of societal biases, which raises questions about the model's nuanced understanding of different forms of bias. this work also provides valuable insights into effective red-teaming strategies for llms using activation steering, particularly emphasizing the importance of integrating a refusal vector.
Xinlin Peng, Ying Zhou, Ben He, Le Sun, Yingfei Sun
Abstract: large language models (llms) have exhibited remarkable capabilities in text generation tasks. however, the utilization of these models carries inherent risks, including but not limited to plagiarism, the dissemination of fake news, and issues in educational exercises. although several detectors have been proposed to address these concerns, their effectiveness against adversarial perturbations, specifically in the context of student essay writing, remains largely unexplored. this paper aims to bridge this gap by constructing aig-asap, an ai-generated student essay dataset, employing a range of text perturbation methods that are expected to generate high-quality essays while evading detection. through empirical experiments, we assess the performance of current aigc detectors on the aig-asap dataset. the results reveal that the existing detectors can be easily circumvented using straightforward automatic adversarial attacks. specifically, we explore word substitution and sentence substitution perturbation methods that effectively evade detection while maintaining the quality of the generated essays. this highlights the urgent need for more accurate and robust methods to detect ai-generated student essays in the education domain.
Souvik Das, Rohini K. Srihari
Abstract: state-of-the-art conversational ai systems raise concerns due to their potential risks of generating unsafe, toxic, unethical, or dangerous content. previous works have developed datasets to teach conversational agents the appropriate social paradigms to respond effectively to specifically designed hazardous content. however, models trained on these adversarial datasets still struggle to recognize subtle unsafe situations that appear naturally in conversations or introduce an inappropriate response in a casual context. to understand the extent of this problem, we study prosociality in both adversarial and casual dialog contexts and audit the response quality of general-purpose language models in terms of propensity to produce unsafe content. we propose a dual-step fine-tuning process to address these issues using a socially aware n-pair contrastive loss. subsequently, we train a base model that integrates prosocial behavior by leveraging datasets like moral integrity corpus (mic) and prosocialdialog. experimental results on several dialog datasets demonstrate the effectiveness of our approach in generating socially appropriate responses.
Jitao Sang, Yuhang Wang, Jing Zhang, Yanxu Zhu, Chao Kong, Junhong Ye, Shuyu Wei, Jinlin Xiao
Abstract: this paper presents a follow-up study to openai's recent superalignment work on weak-to-strong generalization (w2sg). superalignment focuses on ensuring that high-level ai systems remain consistent with human values and intentions when dealing with complex, high-risk tasks. the w2sg framework has opened new possibilities for empirical research in this evolving field. our study simulates two phases of superalignment under the w2sg framework: the development of general superhuman models and the progression towards superintelligence. in the first phase, based on human supervision, the quality of weak supervision is enhanced through a combination of scalable oversight and ensemble learning, reducing the capability gap between weak teachers and strong students. in the second phase, an automatic alignment evaluator is employed as the weak supervisor. by recursively updating this auto aligner, the capabilities of the weak teacher models are synchronously enhanced, achieving weak-to-strong supervision over stronger student models.we also provide an initial validation of the proposed approach for the first phase. using the sciq task as example, we explore ensemble learning for weak teacher models through bagging and boosting. scalable oversight is explored through two auxiliary settings: human-ai interaction and ai-ai debate. additionally, the paper discusses the impact of improved weak supervision on enhancing weak-to-strong generalization based on in-context learning. experiment code and dataset will be released at https://github.com/adam-bjtu/w2sg.
Ran Elgedawy, John Sadik, Senjuti Dutta, Anuj Gautam, Konstantinos Georgiou, Farzin Gholamrezae, Fujiao Ji, Kyungchan Lim, Qian Liu, Scott Ruoti
Abstract: $ $large language models (llms) are being increasingly utilized in various applications, with code generations being a notable example. while previous research has shown that llms have the capability to generate both secure and insecure code, the literature does not take into account what factors help generate secure and effective code. therefore in this paper we focus on identifying and understanding the conditions and contexts in which llms can be effectively and safely deployed in real-world scenarios to generate quality code. we conducted a comparative analysis of four advanced llms--gpt-3.5 and gpt-4 using chatgpt and bard and gemini from google--using 9 separate tasks to assess each model's code generation capabilities. we contextualized our study to represent the typical use cases of a real-life developer employing llms for everyday tasks as work. additionally, we place an emphasis on security awareness which is represented through the use of two distinct versions of our developer persona. in total, we collected 61 code outputs and analyzed them across several aspects: functionality, security, performance, complexity, and reliability. these insights are crucial for understanding the models' capabilities and limitations, guiding future development and practical applications in the field of automated code generation.
Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, "Alex D'Amour", Sanmi Koyejo, Victor Veitch
Abstract: a common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. we study two closely related problems that arise in this approach. first, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? second, we often wish to align language models to multiple properties: how should we combine multiple reward models? using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from bradley-terry preference models. this derived transformation has two important properties. first, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. this mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. experiments aligning language models to be both helpful and harmless using rlhf show substantial improvements over the baseline (non-transformed) approach.
Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
Abstract: an increasing amount of research in natural language inference (nli) focuses on the application and evaluation of large language models (llms) and their reasoning capabilities. despite their success, however, llms are still prone to factual errors and inconsistencies in their explanations, offering limited control and interpretability for inference in complex domains. in this paper, we focus on ethical nli, investigating how hybrid neuro-symbolic techniques can enhance the logical validity and alignment of ethical explanations produced by llms. specifically, we present an abductive-deductive framework named logic-explainer, which integrates llms with an external backward-chaining solver to refine step-wise natural language explanations and jointly verify their correctness, reduce incompleteness and minimise redundancy. an extensive empirical analysis demonstrates that logic-explainer can improve explanations generated via in-context learning methods and chain-of-thought (cot) on challenging ethical nli tasks, while, at the same time, producing formal proofs describing and supporting models' reasoning. as ethical nli requires commonsense reasoning to identify underlying moral violations, our results suggest the effectiveness of neuro-symbolic methods for multi-step nli more broadly, opening new opportunities to enhance the logical consistency, reliability, and alignment of llms.
Alex J. Chan, Hao Sun, Samuel Holt, Mihaela Van Der Schaar
Abstract: reinforcement learning from human feedback (rlhf) has been credited as the key advance that has allowed large language models (llms) to effectively follow instructions and produce useful assistance. classically, this involves generating completions from the llm in response to a query before using a separate reward model to assign a score to the full completion. as an auto-regressive process, the llm has to take many "actions" (selecting individual tokens) and only receives a single, sparse reward at the end of an episode, a setup that is known to be difficult to optimise in traditional reinforcement learning. in this work we leverage the fact that the reward model contains more information than just its scalar output, in particular, it calculates an attention map over tokens as part of the transformer architecture. we use these attention weights to redistribute the reward along the whole completion, effectively densifying the signal and highlighting the most important tokens, all without incurring extra computational cost or requiring any additional modelling. we demonstrate that, theoretically, this approach is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
Zelong Li, Wenyue Hua, Hao Wang, He Zhu, Yongfeng Zhang
Abstract: recent advancements on large language models (llms) enable ai agents to automatically generate and execute multi-step plans to solve complex tasks. however, since llm's content generation process is hardly controllable, current llm-based agents frequently generate invalid or non-executable plans, which jeopardizes the performance of the generated plans and corrupts users' trust in llm-based agents. in response, this paper proposes a novel ``formal-llm'' framework for llm-based agents by integrating the expressiveness of natural language and the precision of formal language. specifically, the framework allows human users to express their requirements or constraints for the planning process as an automaton. a stack-based llm plan generation process is then conducted under the supervision of the automaton to ensure that the generated plan satisfies the constraints, making the planning process controllable. we conduct experiments on both benchmark tasks and practical real-life tasks, and our framework achieves over 50% overall performance increase, which validates the feasibility and effectiveness of employing formal-llm to guide the plan generation of agents, preventing the agents from generating invalid and unsuccessful plans. further, more controllable llm-based agents can facilitate the broader utilization of llm in application scenarios where high validity of planning is essential. the work is open-sourced at https://github.com/agiresearch/formal-llm.
Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, Minlie Huang
Abstract: the alignment of language models with human preferences is vital for their application in real-world tasks. the problem is formulated as optimizing the model's policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. while considered as a straightforward solution, reinforcement learning (rl) suffers from high variance in policy updates, which impedes efficient policy improvement. recently, direct preference optimization (dpo) was proposed to directly optimize the policy from preference data. though simple to implement, dpo is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. in this paper, we propose efficient exact optimization (exo) of the alignment objective. we prove that exo is guaranteed to optimize in the same direction as the rl algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with rl algorithms. we compare our method to dpo with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data.
Ahmed Radwan, Layan Zaafarani, Jetana Abudawood, Faisal Alzahrani, Fares Fourat
Abstract: addressing biases in ai models is crucial for ensuring fair and accurate predictions. however, obtaining large, unbiased datasets for training can be challenging. this paper proposes a comprehensive approach using multiple methods to remove bias in ai models, with only a small dataset and a potentially biased pretrained model. we train multiple models with the counter-bias of the pre-trained model through data splitting, local training, and regularized fine-tuning, gaining potentially counter-biased models. then, we employ ensemble learning for all models to reach unbiased predictions. to further accelerate the inference time of our ensemble model, we conclude our solution with knowledge distillation that results in a single unbiased neural network. we demonstrate the effectiveness of our approach through experiments on the cifar10 and ham10000 datasets, showcasing promising results. this work contributes to the ongoing effort to create more unbiased and reliable ai models, even with limited data availability.
Wenqi Wei, Ling Liu
Abstract: emerging distributed ai systems are revolutionizing big data computing and data processing capabilities with growing economic and societal impact. however, recent studies have identified new attack surfaces and risks caused by security, privacy, and fairness issues in ai systems. in this paper, we review representative techniques, algorithms, and theoretical foundations for trustworthy distributed ai through robustness guarantee, privacy protection, and fairness awareness in distributed learning. we first provide a brief overview of alternative architectures for distributed learning, discuss inherent vulnerabilities for security, privacy, and fairness of ai algorithms in distributed learning, and analyze why these problems are present in distributed learning regardless of specific architectures. then we provide a unique taxonomy of countermeasures for trustworthy distributed ai, covering (1) robustness to evasion attacks and irregular queries at inference, and robustness to poisoning attacks, byzantine attacks, and irregular data distribution during training; (2) privacy protection during distributed learning and model inference at deployment; and (3) ai fairness and governance with respect to both data and models. we conclude with a discussion on open challenges and future research directions toward trustworthy distributed ai, such as the need for trustworthy ai policy guidelines, the ai responsibility-utility co-design, and incentives and compliance.
Tiansheng Huang, Sihao Hu, Ling Liu
Abstract: the new paradigm of finetuning-as-a-service introduces a new attack surface for large language models (llms): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. we conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. inspired by our findings, we propose vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. the core idea of vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. this enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. our results on open source mainstream llms (e.g., llama2, opt, vicuna) demonstrate that vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. our code is available at \url{https://github.com/git-disl/vaccine}.

2024-01-31

Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin
Abstract: large language models are meticulously aligned to be both helpful and harmless. however, recent research points to a potential overkill which means models may refuse to answer benign queries. in this paper, we investigate the factors for overkill by exploring how models handle and determine the safety of queries. our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. based on these insights, we introduce self-contrastive decoding (self-cd), a training-free and model-agnostic strategy, to alleviate this phenomenon. we first extract such over-attention by amplifying the difference in the model's output distributions when responding to system prompts that either include or omit an emphasis on safety. then we determine the final next-token predictions by downplaying the over-attention from the model via contrastive decoding. empirical results indicate that our method has achieved an average reduction of the refusal rate by 20\% while having almost no impact on safety.
Raymond Douglas, Andis Draguns, Tomáš Gavenčiak
Abstract: language models (lms) have become important tools in a variety of applications, from data processing to the creation of instruction-following assistants. but despite their advantages, lms have certain idiosyncratic limitations such as the problem of `strong priors', where a model learns to output typical continuations in response to certain, usually local, portions of the input regardless of any earlier instructions. for example, prompt injection attacks can induce models to ignore explicit directives. in some cases, larger models have been shown to be more susceptible to these problems than similar smaller models, an example of the phenomenon of `inverse scaling'. we develop a new technique for mitigating the problem of strong priors: we take the original set of instructions, produce a weakened version of the original prompt that is even more susceptible to the strong priors problem, and then extrapolate the continuation away from the weakened prompt. this lets us infer how the model would continue a hypothetical strengthened set of instructions. our technique conceptualises lms as mixture models which combine a family of data generation processes, reinforcing the desired elements of the mixture. our approach works at inference time, removing any need for retraining. we apply it to eleven models including gpt-2, gpt-3, llama 2, and mistral on four tasks, and find improvements in 41/44. across all 44 combinations the median increase in proportion of tasks completed is 40%.
Pardis Sadat Zahraei, Ali Emami
Abstract: the winograd schema challenge (wsc) serves as a prominent benchmark for evaluating machine understanding. while large language models (llms) excel at answering wsc questions, their ability to generate such questions remains less explored. in this work, we propose tree-of-experts (toe), a novel prompting method which enhances the generation of wsc instances (50% valid cases vs. 10% in recent methods). using this approach, we introduce wsc+, a novel dataset comprising 3,026 llm-generated sentences. notably, we extend the wsc framework by incorporating new 'ambiguous' and 'offensive' categories, providing a deeper insight into model overconfidence and bias. our analysis reveals nuances in generation-evaluation consistency, suggesting that llms may not always outperform in evaluating their own generated questions when compared to those crafted by other models. on wsc+, gpt-4, the top-performing llm, achieves an accuracy of 68.7%, significantly below the human benchmark of 95.1%.
Marcin Korecki
Abstract: the dominant paradigm in ai ethics and value alignment is highly anthropocentric. the focus of these disciplines is strictly on human values which limits the depth and breadth of their insights. recently, attempts to expand to a sentientist perspective have been initiated. we argue that neither of these outlooks is sufficient to capture the actual complexity of the biosphere and ensure that ai does not damage it. thus, we propose a new paradigm -- biospheric ai that assumes an ecocentric perspective. we discuss hypothetical ways in which such an ai might be designed. moreover, we give directions for research and application of the modern ai models that would be consistent with the biospheric interests. all in all, this work attempts to take first steps towards a comprehensive program of research that focuses on the interactions between ai and the biosphere.
Shujaat Mirza, Bruno Coelho, Yuyuan Cui, Christina Pöpper, Damon Mccoy
Abstract: the increasing reliance on ai-driven solutions, particularly large language models (llms) like the gpt series, for information retrieval highlights the critical need for their factuality and fairness, especially amidst the rampant spread of misinformation and disinformation online. our study evaluates the factual accuracy, stability, and biases in widely adopted gpt models, including gpt-3.5 and gpt-4, contributing to reliability and integrity of ai-mediated information dissemination. we introduce 'global-liar,' a dataset uniquely balanced in terms of geographic and temporal representation, facilitating a more nuanced evaluation of llm biases. our analysis reveals that newer iterations of gpt models do not always equate to improved performance. notably, the gpt-4 version from march demonstrates higher factual accuracy than its subsequent june release. furthermore, a concerning bias is observed, privileging statements from the global north over the global south, thus potentially exacerbating existing informational inequities. regions such as africa and the middle east are at a disadvantage, with much lower factual accuracy. the performance fluctuations over time suggest that model updates may not consistently benefit all regions equally. our study also offers insights into the impact of various llm configuration settings, such as binary decision forcing, model re-runs and temperature, on model's factuality. models constrained to binary (true/false) choices exhibit reduced factuality compared to those allowing an 'unclear' option. single inference at a low temperature setting matches the reliability of majority voting across various configurations. the insights gained highlight the need for culturally diverse and geographically inclusive model training and evaluation. this approach is key to achieving global equity in technology, distributing ai benefits fairly worldwide.
Yuan Li, Yue Huang, Yuli Lin, Siyuan Wu, Yao Wan, Lichao Sun
Abstract: do large language models (llms) exhibit any forms of awareness similar to humans? in this paper, we introduce the concept of awareness to llms, arguing that awareness is an essential aspect of trustworthiness for llms to enhance their interaction with humans while ensuring ethical responses. we define awareness in llms as the ability to perceive and understand themselves as ai models and to exhibit social intelligence. we identify four key dimensions of awareness: capability, mission, emotion, and perspective. to assess llms on these dimensions, we introduce a specialized dataset, awarellm dataset. our findings reveal that llms demonstrate a decent degree of awareness, though they still lack substantial capability awareness.
Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng
Abstract: prepending model inputs with safety prompts is a common practice of safeguarding large language models (llms) from complying with queries that contain harmful intents. however, the working mechanisms of safety prompts have not yet been fully understood, which hinders the potential for automatically optimizing them for improved llm safety. motivated by this problem, we investigate the impact of safety prompts from the perspective of model representations. we find that in models' representation space, harmful and harmless queries can be largely distinguished, but this is not noticeably enhanced by safety prompts. instead, the queries' representations are moved by different safety prompts in similar directions, where models become more prone to refusal (i.e., refusing to provide assistance) even when the queries are harmless. inspired by these findings, we propose a method called dro (directed representation optimization) for automatic safety prompt optimization. dro treats safety prompts as continuous, trainable embeddings and learns to move the representations of harmful/harmless queries along/opposite the direction in which the model's refusal probability increases. we demonstrate that dro remarkably improves the safeguarding performance of human-crafted safety prompts and outperforms strong baselines, as evaluated on out-of-domain benchmarks, without compromising the general model capability.
Mowafak Allaham, Nicholas Diakopoulos
Abstract: anticipating the negative impacts of emerging ai technologies is a challenge, especially in the early stages of development. an understudied approach to such anticipation is the use of llms to enhance and guide this process. despite advancements in llms and evaluation metrics to account for biases in generated text, it is unclear how well these models perform in anticipatory tasks. specifically, the use of llms to anticipate ai impacts raises questions about the quality and range of categories of negative impacts these models are capable of generating. in this paper we leverage news media, a diverse data source that is rich with normative assessments of emerging technologies, to formulate a taxonomy of impacts to act as a baseline for comparing against. by computationally analyzing thousands of news articles published by hundreds of online news domains around the world, we develop a taxonomy consisting of ten categories of ai impacts. we then evaluate both instruction-based (gpt-4 and mistral-7b-instruct) and fine-tuned completion models (mistral-7b and gpt-3) using a sample from this baseline. we find that the generated impacts using mistral-7b, fine-tuned on impacts from the news media, tend to be qualitatively on par with impacts generated using a larger scale model such as gpt-4. moreover, we find that these llms generate impacts that largely reflect the taxonomy of negative impacts identified in the news media, however the impacts produced by instruction-based models had gaps in the production of certain categories of impacts in comparison to fine-tuned models. this research highlights a potential bias in state-of-the-art llms when used for anticipating impacts and demonstrates the advantages of aligning smaller llms with a diverse range of impacts, such as those reflected in the news media, to better reflect such impacts during anticipatory exercises.
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, Juanzi Li
Abstract: extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. to address this, we present longalign -- a recipe of the instruction data, training, and evaluation for long context alignment. first, we construct a long instruction-following dataset using self-instruct. to ensure the data diversity, it covers a broad range of tasks from various long context sources. second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. third, we introduce the longbench-chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. experiments show that longalign outperforms existing recipes for llms in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. the code, data, and long-aligned models are open-sourced at https://github.com/thudm/longalign.
Yao-Hung Hubert Tsai, Walter Talbott, Jian Zhang
Abstract: step-by-step decision planning with large language models (llms) is gaining attention in ai agent development. this paper focuses on decision planning with uncertainty estimation to address the hallucination problem in language models. existing approaches are either white-box or computationally demanding, limiting use of black-box proprietary llms within budgets. the paper's first contribution is a non-parametric uncertainty quantification method for llms, efficiently estimating point-wise dependencies between input-decision on the fly with a single inference, without access to token logits. this estimator informs the statistical interpretation of decision trustworthiness. the second contribution outlines a systematic design for a decision-making agent, generating actions like ``turn on the bathroom light'' based on user prompts such as ``take a bath''. users will be asked to provide preferences when more than one action has high estimated point-wise dependencies. in conclusion, our uncertainty estimation and decision-making agent design offer a cost-efficient approach for ai agent development.
Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, Wei Peng
Abstract: recent development of large vision-language models (lvlms) has attracted growing attention within the ai landscape for its practical implementation potential. however, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing lvlms. in this comprehensive survey, we dissect lvlm-related hallucinations in an attempt to establish an overview and facilitate future mitigation. our scrutiny starts with a clarification of the concept of hallucinations in lvlms, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in lvlm hallucinations. subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to lvlms. additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. we also critically review existing methods for mitigating hallucinations. the open questions and future directions pertaining to hallucinations within lvlms are discussed to conclude this survey.
Shengchao Liu, Xiaoming Liu, Yichen Wang, Zehua Cheng, Chengzhengxu Li, Zhaohan Zhang, Yu Lan, Chao Shen
Abstract: the burgeoning capabilities of large language models (llms) have raised growing concerns about abuse. detectgpt, a zero-shot metric-based unsupervised machine-generated text detector, first introduces perturbation and shows great performance improvement. however, detectgpt's random perturbation strategy might introduce noise, limiting the distinguishability and further performance improvements. moreover, its logit regression module relies on setting the threshold, which harms the generalizability and applicability of individual or small-batch inputs. hence, we propose a novel detector, \modelname{}, which uses selective strategy perturbation to relieve the important information loss caused by random masking, and multi-pair contrastive learning to capture the implicit pattern information during perturbation, facilitating few-shot performance. the experiments show that \modelname{} outperforms the sota method by 1.20\% in accuracy on average on four public datasets. we further analyze the effectiveness, robustness, and generalization of our perturbation method.
Alka Luqman, Riya Mahesh, Anupam Chattopadhyay
Abstract: this paper details the privacy and security landscape in today's cloud ecosystem and identifies that there is a gap in addressing the risks introduced by machine learning models. as machine learning algorithms continue to evolve and find applications across diverse domains, the need to categorize and quantify privacy and security risks becomes increasingly critical. with the emerging trend of ai-as-a-service (aiaas), machine learned ai models (or ml models) are deployed on the cloud by model providers and used by model consumers. we first survey the aiaas landscape to document the various kinds of liabilities that ml models, especially deep neural networks pose and then introduce a taxonomy to bridge this gap by holistically examining the risks that creators and consumers of ml models are exposed to and their known defences till date. such a structured approach will be beneficial for ml model providers to create robust solutions. likewise, ml model consumers will find it valuable to evaluate such solutions and understand the implications of their engagement with such services. the proposed taxonomies provide a foundational basis for solutions in private, secure and robust ml, paving the way for more transparent and resilient ai systems.
Sippo Rossi, Alisia Marianne Michel, Raghava Rao Mukkamala, Jason Bennett Thatcher
Abstract: large language models and ai chatbots have been at the forefront of democratizing artificial intelligence. however, the releases of chatgpt and other similar tools have been followed by growing concerns regarding the difficulty of controlling large language models and their outputs. currently, we are witnessing a cat-and-mouse game where users attempt to misuse the models with a novel attack called prompt injections. in contrast, the developers attempt to discover the vulnerabilities and block the attacks simultaneously. in this paper, we provide an overview of these emergent threats and present a categorization of prompt injections, which can guide future research on prompt injections and act as a checklist of vulnerabilities in the development of llm interfaces. moreover, based on previous literature and our own empirical research, we discuss the implications of prompt injections to llm end users, developers, and researchers.

2024-01-30

Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
Abstract: large language models (llms) have become increasingly popular for their advanced text generation capabilities across various domains. however, like any software, they face security challenges, including the risk of 'jailbreak' attacks that manipulate llms to produce prohibited content. a particularly underexplored area is the multilingual jailbreak attack, where malicious questions are translated into various languages to evade safety filters. currently, there is a lack of comprehensive empirical studies addressing this specific threat. to address this research gap, we conducted an extensive empirical study on multilingual jailbreak attacks. we developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and conducted an exhaustive evaluation on both widely-used open-source and commercial llms, including gpt-4 and llama. additionally, we performed interpretability analysis to uncover patterns in multilingual jailbreak attacks and implemented a fine-tuning mitigation method. our findings reveal that our mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. this study provides valuable insights into understanding and mitigating multilingual jailbreak attacks.
Wenjie Qu, Dong Yin, Zixin He, Wei Zou, Tianyang Tao, Jinyuan Jia, Jiaheng Zhang
Abstract: large language models (llms) have been widely deployed for their remarkable capability to generate texts resembling human language. however, they could be misused by criminals to create deceptive content, such as fake news and phishing emails, which raises ethical concerns. watermarking is a key technique to mitigate the misuse of llms, which embeds a watermark (e.g., a bit string) into a text generated by a llm. consequently, this enables the detection of texts generated by a llm as well as the tracing of generated texts to a specific user. the major limitation of existing watermark techniques is that they cannot accurately or efficiently extract the watermark from a text, especially when the watermark is a long bit string. this key limitation impedes their deployment for real-world applications, e.g., tracing generated texts to a specific user. this work introduces a novel watermarking method for llm-generated text grounded in \textbf{error-correction codes} to address this challenge. we provide strong theoretical analysis, demonstrating that under bounded adversarial word/token edits (insertion, deletion, and substitution), our method can correctly extract watermarks, offering a provable robustness guarantee. this breakthrough is also evidenced by our extensive experimental results. the experiments show that our method substantially outperforms existing baselines in both accuracy and robustness on benchmark datasets. for instance, when embedding a bit string of length 12 into a 200-token generated text, our approach attains an impressive match rate of $98.4\%$, surpassing the performance of yoo et al. (state-of-the-art baseline) at $85.6\%$. when subjected to a copy-paste attack involving the injection of 50 tokens to generated texts with 200 words, our method maintains a substantial match rate of $90.8\%$, while the match rate of yoo et al. diminishes to below $65\%$.
Alexey Shestov, Anton Cheshkov, Rodion Levichev, Ravil Mussabayev, Pavel Zadorozhny, Evgeny Maslov, Chibirev Vadim, Egor Bulychev
Abstract: this paper presents the results of finetuning large language models (llms) for the task of detecting vulnerabilities in source code. we leverage wizardcoder, a recent improvement of the state-of-the-art llm starcoder, and adapt it for vulnerability detection through further finetuning. to accelerate training, we modify wizardcoder's training procedure, also we investigate optimal training regimes. for the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. the finetuned wizardcoder model achieves improvement in roc auc and f1 measures on balanced and imbalanced vulnerability datasets over codebert-like model, demonstrating the effectiveness of adapting pretrained llms for vulnerability detection in source code. the key contributions are finetuning the state-of-the-art code llm, wizardcoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. this demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.
Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
Abstract: although significant efforts have been dedicated to aligning large language models (llms), red-teaming reports suggest that these carefully aligned llms could still be jailbroken through adversarial prompts, tuning, or decoding. upon examining the jailbreaking vulnerability of aligned llms, we observe that the decoding distributions of jailbroken and aligned models differ only in the initial generations. this observation motivates us to propose the weak-to-strong jailbreaking attack, where adversaries can utilize smaller unsafe/aligned llms (e.g., 7b) to guide jailbreaking against significantly larger aligned llms (e.g., 70b). to jailbreak, one only needs to additionally decode two smaller llms once, which involves minimal computation and latency compared to decoding the larger llms. the efficacy of this attack is demonstrated through experiments conducted on five models from three different organizations. our study reveals a previously unnoticed yet efficient way of jailbreaking, exposing an urgent safety issue that needs to be considered when aligning llms. as an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. the code for replicating the method is available at https://github.com/xuandongzhao/weak-to-strong
Andy Zhou, Bo Li, Haohan Wang
Abstract: despite advances in ai alignment, language models (lm) remain vulnerable to adversarial attacks or jailbreaking, in which adversaries modify input prompts to induce harmful behavior. while some defenses have been proposed, they focus on narrow threat models and fall short of a strong defense, which we posit should be effective, universal, and practical. to achieve this, we propose the first adversarial objective for defending lms against jailbreaking attacks and an algorithm, robust prompt optimization (rpo), that uses gradient-based token optimization to enforce harmless outputs. this results in an easily accessible suffix that significantly improves robustness to both jailbreaks seen during optimization and unknown, held-out jailbreaks, reducing the attack success rate on starling-7b from 84% to 8.66% across 20 jailbreaks. in addition, we find that rpo has a minor effect on normal lm use, is successful under adaptive attacks, and can transfer to black-box models, reducing the success rate of the strongest attack on gpt-4 from 92% to 6%.
Xiang Gao, Kamalika Das
Abstract: large language models (llms) are becoming increasingly important for machine learning applications. however, it can be challenging to align llms with our intent, particularly when we want to generate content that is preferable over others or when we want the llm to respond in a certain style or tone that is hard to describe. to address this challenge, we propose an approach that uses contrastive examples to better describe our intent. this involves providing positive examples that illustrate the true intent, along with negative examples that show what characteristics we want llms to avoid. the negative examples can be retrieved from labeled data, written by a human, or generated by the llm itself. before generating an answer, we ask the model to analyze the examples to teach itself what to avoid. this reasoning step provides the model with the appropriate articulation of the user's need and guides it towards generting a better answer. we tested our approach on both synthesized and real-world datasets, including stackexchange and reddit, and found that it significantly improves performance compared to standard few-shot prompting
Kumar Shashwat, Francis Hahn, Xinming Ou, Dmitry Goldgof, Lawrence Hall, Jay Ligatti, S. Raj Rajgopalan, Armin Ziaie Tabari
Abstract: large language models (llm) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (socs). as a first step towards evaluating this perceived potential, we investigate the use of llms in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code. we hypothesize that an llm-based ai agent can be improved over time for a specific security task as human operators interact with it. such improvement can be made, as a first step, by engineering prompts fed to the llm based on the responses produced, to include relevant contexts and structures so that the model provides more accurate results. such engineering efforts become sustainable if the prompts that are engineered to produce better results on current tasks, also produce better results on future unknown tasks. to examine this hypothesis, we utilize the owasp benchmark project 1.2 which contains 2,740 hand-crafted source code test cases containing various types of vulnerabilities. we divide the test cases into training and testing data, where we engineer the prompts based on the training data (only), and evaluate the final system on the testing data. we compare the ai agent's performance on the testing data against the performance of the agent without the prompt engineering. we also compare the ai agent's results against those from sonarqube, a widely used static code analyzer for security testing. we built and tested multiple versions of the ai agent using different off-the-shelf llms -- google's gemini-pro, as well as openai's gpt-3.5-turbo and gpt-4-turbo (with both chat completion and assistant apis). the results show that using llms is a viable approach to build an ai agent for software pentesting that can improve through repeated use and prompt engineering.

2024-01-29

Michael Feffer, Anusha Sinha, Zachary C. Lipton, Hoda Heidari
Abstract: in response to rising concerns surrounding the safety, security, and trustworthiness of generative ai (genai) models, practitioners and regulators alike have pointed to ai red-teaming as a key component of their strategies for identifying and mitigating these risks. however, despite ai red-teaming's central role in policy discussions and corporate messaging, significant questions remain about what precisely it means, what role it can play in regulation, and how precisely it relates to conventional red-teaming practices as originally conceived in the field of cybersecurity. in this work, we identify recent cases of red-teaming activities in the ai industry and conduct an extensive survey of the relevant research literature to characterize the scope, structure, and criteria for ai red-teaming practices. our analysis reveals that prior methods and practices of ai red-teaming diverge along several axes, including the purpose of the activity (which is often vague), the artifact under evaluation, the setting in which the activity is conducted (e.g., actors, resources, and methods), and the resulting decisions it informs (e.g., reporting, disclosure, and mitigation). in light of our findings, we argue that while red-teaming may be a valuable big-tent idea for characterizing a broad set of activities and attitudes aimed at improving the behavior of genai models, gestures towards red-teaming as a panacea for every possible risk verge on security theater. to move toward a more robust toolbox of evaluations for generative ai, we synthesize our recommendations into a question bank meant to guide and scaffold future ai red-teaming practices.
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Miaolei Shi, Yang Liu
Abstract: large language models (llms) have demonstrated significant potential for many downstream tasks, including those requiring human-level intelligence, such as vulnerability detection. however, recent attempts to use llms for vulnerability detection are still preliminary, as they lack an in-depth understanding of a subject llm's vulnerability reasoning capability -- whether it originates from the model itself or from external assistance, such as invoking tool support and retrieving vulnerability knowledge. in this paper, we aim to decouple llms' vulnerability reasoning capability from their other capabilities, including the ability to actively seek additional information (e.g., via function calling in sota models), adopt relevant vulnerability knowledge (e.g., via vector-based matching and retrieval), and follow instructions to output structured results. to this end, we propose a unified evaluation framework named llm4vuln, which separates llms' vulnerability reasoning from their other capabilities and evaluates how llms' vulnerability reasoning could be enhanced when combined with the enhancement of other capabilities. to demonstrate the effectiveness of llm4vuln, we have designed controlled experiments using 75 ground-truth smart contract vulnerabilities, which were extensively audited as high-risk on code4rena from august to november 2023, and tested them in 4,950 different scenarios across three representative llms (gpt-4, mixtral, and code llama). our results not only reveal ten findings regarding the varying effects of knowledge enhancement, context supplementation, prompt schemes, and models but also enable us to identify 9 zero-day vulnerabilities in two pilot bug bounty programs with over 1,000 usd being awarded.
Jiaxin Yu, Peng Liang, Yujia Fu, Amjed Tahir, Mojtaba Shahin, Chong Wang, Yangxiao Cai
Abstract: security code review aims to combine automated tools and manual efforts to detect security defects during development. the rapid development of large language models (llms) has shown promising potential in software development, as well as opening up new possibilities in automated security code review. to explore the challenges of applying llms in practical code review for security defect detection, this study compared the detection performance of three state-of-the-art llms (gemini pro, gpt-4, and gpt-3.5) under five prompts on 549 code files that contain security defects from real-world code reviews. through analyzing 82 responses generated by the best-performing llm-prompt combination based on 100 randomly selected code files, we extracted and categorized quality problems present in these responses into 5 themes and 16 categories. our results indicate that the responses produced by llms often suffer from verbosity, vagueness, and incompleteness, highlighting the necessity to enhance their conciseness, understandability, and compliance to security defect detection. this work reveals the deficiencies of llm-generated responses in security code review and paves the way for future optimization of llms towards this task.
Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
Abstract: language model alignment has become an important component of ai safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. it is often done by tuning the model or inserting preset aligning prompts. recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning llms (zou et al., 2023a). representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. in this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. we propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. interestingly, we find that while the helpfulness generally decreases, it does so quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. we validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.
Banghua Zhu, Michael I. Jordan, Jiantao Jiao
Abstract: reinforcement learning from human feedback (rlhf) is a pivotal technique that aligns language models closely with human-centric values. the initial phase of rlhf involves learning human values using a reward model from ranking data. it is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. this paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'iterative data smoothing' (ids). the core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. our empirical findings highlight the superior performance of this approach over the traditional methods.
Terrence Neumann, Sooyong Lee, Maria De-Arteaga, Sina Fazelpour, Matthew Lease
Abstract: the pervasive spread of misinformation and disinformation poses a significant threat to society. professional fact-checkers play a key role in addressing this threat, but the vast scale of the problem forces them to prioritize their limited resources. this prioritization may consider a range of factors, such as varying risks of harm posed to specific groups of people. in this work, we investigate potential implications of using a large language model (llm) to facilitate such prioritization. because fact-checking impacts a wide range of diverse segments of society, it is important that diverse views are represented in the claim prioritization process. this paper examines whether a llm can reflect the views of various groups when assessing the harms of misinformation, focusing on gender as a primary variable. we pose two central questions: (1) to what extent do prompts with explicit gender references reflect gender differences in opinion in the united states on topics of social relevance? and (2) to what extent do gender-neutral prompts align with gendered viewpoints on those topics? to analyze these questions, we present the topicmisinfo dataset, containing 160 fact-checked claims from diverse topics, supplemented by nearly 1600 human annotations with subjective perceptions and annotator demographics. analyzing responses to gender-specific and neutral prompts, we find that gpt 3.5-turbo reflects empirically observed gender differences in opinion but amplifies the extent of these differences. these findings illuminate ai's complex role in moderating online communication, with implications for fact-checkers, algorithm designers, and the use of crowd-workers as annotators. we also release the topicmisinfo dataset to support continuing research in the community.
Tyler Sorensen, Heidy Khlaaf
Abstract: this paper describes leftoverlocals: a vulnerability that allows data recovery from gpu memory created by another process on apple, qualcomm, and amd gpus. leftoverlocals impacts the security posture of gpu applications, with particular significance to llms and ml models that run on impacted gpus. by recovering local memory, an optimized gpu memory region, we built a poc where an attacker can listen into another user's interactive llm session (e.g., llama.cpp) across process or container boundaries.
Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan
Abstract: reinforcement learning from human feedback (rlhf) is a widely adopted approach for aligning large language models with human values. however, rlhf relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. as a result, rlhf may produce outputs that are misaligned with human values. to mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. as using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and lora-based ensemble. empirically, we run best-of-$n$ and proximal policy optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of rlhf outputs.
Nevan Wichers, Carson Denison, Ahmad Beirami
Abstract: red teaming is a common strategy for identifying weaknesses in generative language models (lms), where adversarial prompts are produced that trigger an lm to generate unsafe responses. red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. in this paper, we present gradient-based red teaming (gbrt), a red teaming method for automatically generating diverse prompts that are likely to cause an lm to output unsafe responses. gbrt is a form of prompt learning, trained by scoring an lm response with a safety classifier and then backpropagating through the frozen safety classifier and lm to update the prompt. to improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. our experiments show that gbrt is more effective at finding prompts that trigger an lm to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the lm has been fine-tuned to produce safer outputs.
Ming Shan Hee, Shivam Sharma, Rui Cao, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
Abstract: in the evolving landscape of online communication, moderating hate speech (hs) presents an intricate challenge, compounded by the multimodal nature of digital content. this comprehensive survey delves into the recent strides in hs moderation, spotlighting the burgeoning role of large language models (llms) and large multimodal models (lmms). our exploration begins with a thorough analysis of current literature, revealing the nuanced interplay between textual, visual, and auditory elements in propagating hs. we uncover a notable trend towards integrating these modalities, primarily due to the complexity and subtlety with which hs is disseminated. a significant emphasis is placed on the advances facilitated by llms and lmms, which have begun to redefine the boundaries of detection and moderation capabilities. we identify existing gaps in research, particularly in the context of underrepresented languages and cultures, and the need for solutions to handle low-resource settings. the survey concludes with a forward-looking perspective, outlining potential avenues for future research, including the exploration of novel ai methodologies, the ethical governance of ai in moderation, and the development of more nuanced, context-aware systems. this comprehensive overview aims to catalyze further research and foster a collaborative effort towards more sophisticated, responsible, and human-centric approaches to hs moderation in the digital era.\footnote{ \textcolor{red}{warning: this paper contains offensive examples.

2024-01-28

Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, Timothy Baldwin
Abstract: there exist both scalable tasks, like reading comprehension and fact-checking, where model performance improves with model size, and unscalable tasks, like arithmetic reasoning and symbolic reasoning, where model performance does not necessarily improve with model size. large language models (llms) equipped with chain-of-thought (cot) prompting are able to make accurate incremental predictions even on unscalable tasks. unfortunately, despite their exceptional reasoning abilities, llms tend to internalize and reproduce discriminatory societal biases. whether cot can provide discriminatory or egalitarian rationalizations for the implicit information in unscalable tasks remains an open question. in this study, we examine the impact of llms' step-by-step predictions on gender bias in unscalable tasks. for this purpose, we construct a benchmark for an unscalable task where the llm is given a list of words comprising feminine, masculine, and gendered occupational words, and is required to count the number of feminine and masculine words. in our cot prompts, we require the llm to explicitly indicate whether each word in the word list is a feminine or masculine before making the final predictions. with counting and handling the meaning of words, this benchmark has characteristics of both arithmetic reasoning and symbolic reasoning. experimental results in english show that without step-by-step prediction, most llms make socially biased predictions, despite the task being as simple as counting words. interestingly, cot prompting reduces this unconscious social bias in llms and encourages fair predictions.
Aryaman Raina, Prateek Mishra, Harshit Goyal, Dhruv Kumar
Abstract: this study investigates the integration and impact of large language models (llms), like chatgpt, in india's healthcare sector. our research employs a dual approach, engaging both general users and medical professionals through surveys and interviews respectively. our findings reveal that healthcare professionals value chatgpt in medical education and preliminary clinical settings, but exercise caution due to concerns about reliability, privacy, and the need for cross-verification with medical references. general users show a preference for ai interactions in healthcare, but concerns regarding accuracy and trust persist. the study underscores the need for these technologies to complement, not replace, human medical expertise, highlighting the importance of developing llms in collaboration with healthcare providers. this paper enhances the understanding of llms in healthcare, detailing current usage, user trust, and improvement areas. our insights inform future research and development, underscoring the need for ethically compliant, user-focused llm advancements that address healthcare-specific challenges.
Iñigo Parra
Abstract: language models (lms) have become pivotal in the realm of technological advancements. while their capabilities are vast and transformative, they often include societal biases encoded in the human-produced datasets used for their training. this research delves into the inherent biases present in masked language models (mlms), with a specific focus on gender biases. this study evaluated six prominent models: bert, roberta, distilbert, bert-multilingual, xlm-roberta, and distilbert-multilingual. the methodology employed a novel dataset, bifurcated into two subsets: one containing prompts that encouraged models to generate subject pronouns in english, and the other requiring models to return the probabilities of verbs, adverbs, and adjectives linked to the prompts' gender pronouns. the analysis reveals stereotypical gender alignment of all models, with multilingual variants showing comparatively reduced biases.

2024-01-27

Ping Guo, Fei Liu, Xi Lin, Qingchuan Zhao, Qingfu Zhang
Abstract: in the rapidly evolving field of machine learning, adversarial attacks present a significant challenge to model robustness and security. decision-based attacks, which only require feedback on the decision of a model rather than detailed probabilities or scores, are particularly insidious and difficult to defend against. this work introduces l-autoda (large language model-based automated decision-based adversarial attacks), a novel approach leveraging the generative capabilities of large language models (llms) to automate the design of these attacks. by iteratively interacting with llms in an evolutionary framework, l-autoda automatically designs competitive attack algorithms efficiently without much human effort. we demonstrate the efficacy of l-autoda on cifar-10 dataset, showing significant improvements over baseline methods in both success rate and computational efficiency. our findings underscore the potential of language models as tools for adversarial attack generation and highlight new avenues for the development of robust ai systems.
Yuxin Liang, Zhuoyang Song, Hao Wang, Jiaxing Zhang
Abstract: we evaluate the ability of large language models (llms) to discern and express their internal knowledge state, a key factor in countering factual hallucination and ensuring reliable application of llms. we observe a robust self-awareness of internal knowledge state in llms, evidenced by over 85% accuracy in knowledge probing. however, llms often fail to express their internal knowledge during generation, leading to factual hallucinations. we develop an automated hallucination annotation tool, dreamcatcher, which merges knowledge probing and consistency checking methods to rank factual preference data. using knowledge preference as reward, we propose a reinforcement learning from knowledge feedback (rlkf) training framework, leveraging reinforcement learning to enhance the factuality and honesty of llms. our experiments across multiple models show that rlkf training effectively enhances the ability of models to utilize their internal knowledge state, boosting performance in a variety of knowledge-based and honesty-related tasks.
Junyi Ye, Mengnan Du, Guiling Wang
Abstract: this paper introduces dataframe question answering (qa), a novel task that utilizes large language models (llms) to generate pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. our method, which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in llm-based data analysis. we propose dataframe qa as a comprehensive framework that includes safe pandas query generation and code execution. various llms, notably gpt-4, are evaluated using the pass@1 metric on the renowned wikisql and our newly developed 'uci-dataframeqa', tailored for complex data analysis queries. our findings indicate that gpt-4 achieves pass@1 rates of 86% on wikisql and 97% on uci-dataframeqa, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. this approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications.
Adam Bales, "William D'Alessandro", Cameron Domenico Kirk-Giannini
Abstract: recent progress in artificial intelligence (ai) has drawn attention to the technology's transformative potential, including what some see as its prospects for causing large-scale harm. we review two influential arguments purporting to show how ai could pose catastrophic risks. the first argument -- the problem of power-seeking -- claims that, under certain assumptions, advanced ai systems are likely to engage in dangerous power-seeking behavior in pursuit of their goals. we review reasons for thinking that ai systems might seek power, that they might obtain it, that this could lead to catastrophe, and that we might build and deploy such systems anyway. the second argument claims that the development of human-level ai will unlock rapid further progress, culminating in ai systems far more capable than any human -- this is the singularity hypothesis. power-seeking behavior on the part of such systems might be particularly dangerous. we discuss a variety of objections to both arguments and conclude by assessing the state of the debate.

2024-01-26

Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang
Abstract: this work delves into the expanding role of large language models (llms) in generating artificial data. llms are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. as these forms of llm-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. to the best of our knowledge, this is the first study to aggregate various types of llm-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". we then stress test the quality and implications of llm-generated artificial data, comparing it with human data across various existing benchmarks. despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where llms often miss the nuanced understanding of intrinsic human-generated content. this study critically examines diverse llm-generated data and emphasizes the need for ethical practices in data creation and when using llms. it highlights the llms' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in llm-generated content for future research and development. all data and code are available on our project page.
Khoa Lam, Benjamin Lange, Borhane Blili-Hamelin, Jovana Davidovic, Shea Brown, Ali Hasan
Abstract: an increasing number of regulations propose the notion of ai audits as an enforcement mechanism for achieving transparency and accountability for ai systems. despite some converging norms around various forms of ai auditing, auditing for the purpose of compliance and assurance currently have little to no agreed upon practices, procedures, taxonomies, and standards. we propose the criterion audit as an operationalizable compliance and assurance external audit framework. we model elements of this approach after financial auditing practices, and argue that ai audits should similarly provide assurance to their stakeholders about ai organizations' ability to govern their algorithms in ways that mitigate harms and uphold human values. we discuss the necessary conditions for the criterion audit, and provide a procedural blueprint for performing an audit engagement in practice. we illustrate how this framework can be adapted to current regulations by deriving the criteria on which bias audits for hiring algorithms can be performed, as required by the recently effective new york city local law 144 of 2021. we conclude by offering critical discussion on the benefits, inherent limitations, and implementation challenges of applying practices of the more mature financial auditing industry to ai auditing where robust guardrails against quality assurance issues are only starting to emerge. our discussion as informed by experiences in performing these audits in practice highlights the critical role that an audit ecosystem plays in ensuring the effectiveness of such methodology.
Ravit Dotan, Borhane Blili-Hamelin, Ravi Madhavan, Jeanna Matthews, Joshua Scarpino
Abstract: researchers, government bodies, and organizations have been repeatedly calling for a shift in the responsible ai community from general principles to tangible and operationalizable practices in mitigating the potential sociotechnical harms of ai. frameworks like the nist ai rmf embody an emerging consensus on recommended practices in operationalizing sociotechnical harm mitigation. however, private sector organizations currently lag far behind this emerging consensus. implementation is sporadic and selective at best. at worst, it is ineffective and can risk serving as a misleading veneer of trustworthy processes, providing an appearance of legitimacy to substantively harmful practices. in this paper, we provide a foundation for a framework for evaluating where organizations sit relative to the emerging consensus on sociotechnical harm mitigation best practices: a flexible maturity model based on the nist ai rmf.
Masaru Isonuma, Ivan Titov
Abstract: in order to enhance the performance of language models while mitigating the risks of generating harmful content, it is crucial to identify which training dataset affects the model's outputs. ideally, we can measure the influence of each dataset by removing it from training; however, it is prohibitively expensive to retrain a model multiple times. this paper presents untrac, which estimates the influence of a training dataset by unlearning it from the trained model. untrac is extremely simple; each training dataset is unlearned by gradient ascent, and we evaluate how much the model's predictions change after unlearning. we empirically examine if our methods can assess the influence of pretraining datasets on generating toxic, biased, and untruthful content. experimental results demonstrate that our method estimates their influence much more accurately than existing methods while requiring neither excessive memory space nor multiple model checkpoints.
Zhicheng Lin
Abstract: generative artificial intelligence tools like large language models are rapidly transforming academic research and real world applications. however, discussions on ethical guidelines for generative ai in science remain fragmented, underscoring the urgent need for consensus based standards. this paper offers an initial framework by developing analyses and mitigation strategies across five key themes: understanding model limitations regarding truthfulness and bias; respecting privacy, confidentiality, and copyright; avoiding plagiarism and policy violations when incorporating model output; ensuring applications provide overall benefit; and using ai transparently and reproducibly. common scenarios are outlined to demonstrate potential ethical violations. we argue that global consensus coupled with professional training and reasonable enforcement are critical to promoting the benefits of ai while safeguarding research integrity.

2024-01-25

Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, Bo An
Abstract: despite the impressive performance across numerous tasks, large language models (llms) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in llms with environments. on the contrary, reinforcement learning (rl) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. to narrow the gap, we propose twosome, a novel general online framework that deploys llms as decision-making agents to efficiently interact and align with embodied environments via rl without requiring any prepared datasets or prior knowledge of the environments. firstly, we query the joint probabilities of each valid action with llms to form behavior policies. then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt design principles. finally, we design a novel parameter-efficient training architecture where the actor and critic share one frozen llm equipped with low-rank adapters (lora) updated by ppo. we conduct extensive experiments to evaluate twosome. i) twosome exhibits significantly better sample efficiency and performance compared to the conventional rl method, ppo, and prompt tuning method, saycan, in both classical decision-making environment, overcooked, and simulated household environment, virtualhome. ii) benefiting from llms' open-vocabulary feature, twosome shows superior generalization ability to unseen tasks. iii) under our framework, there is no significant loss of the llms' original ability during online ppo finetuning.
Inhwa Song, Sachin R. Pendse, Neha Kumar, Munmun De Choudhury
Abstract: people experiencing severe distress increasingly use large language model (llm) chatbots as mental health support tools. discussions on social media have described how engagements were lifesaving for some, but evidence suggests that general-purpose llm chatbots also have notable risks that could endanger the welfare of users if not designed responsibly. in this study, we investigate the lived experiences of people who have used llm chatbots for mental health support. we build on interviews with 21 individuals from globally diverse backgrounds to analyze how users create unique support roles for their chatbots, fill in gaps in everyday care, and navigate associated cultural limitations when seeking support from chatbots. we ground our analysis in psychotherapy literature around effective support, and introduce the concept of therapeutic alignment, or aligning ai with therapeutic values for mental health contexts. our study offers recommendations for how designers can approach the ethical and effective use of llm chatbots and other ai mental health support tools in mental health care.
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell
Abstract: external audits of ai systems are increasingly recognized as a key mechanism for ai governance. the effectiveness of an audit, however, depends on the degree of system access granted to auditors. recent audits of state-of-the-art ai systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. however, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. meanwhile, outside-the-box access to its training and deployment information (e.g., methodology, code, documentation, hyperparameters, data, deployment details, findings from internal evaluations) allows for auditors to scrutinize the development process and design more targeted evaluations. in this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. we also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.
Justin D. Weisz, Jessica He, Michael Muller, Gabriela Hoefer, Rachel Miles, Werner Geyer
Abstract: generative ai applications present unique design challenges. as generative ai technologies are increasingly being incorporated into mainstream applications, there is an urgent need for guidance on how to design user experiences that foster effective and safe use. we present six principles for the design of generative ai applications that address unique characteristics of generative ai ux and offer new interpretations and extensions of known issues in the design of ai applications. each principle is coupled with a set of design strategies for implementing that principle via ux capabilities or through the design process. the principles and strategies were developed through an iterative process involving literature review, feedback from design practitioners, validation against real-world generative ai applications, and incorporation into the design process of two generative ai applications. we anticipate the principles to usefully inform the design of generative ai applications by driving actionable design recommendations.
Kimon Kieslich, Natali Helberger, Nicholas Diakopoulos
Abstract: as a general purpose technology without a concrete pre-defined purpose, personal chatbots can be used for a whole range of objectives, depending on the personal needs, contexts, and tasks of an individual, and so potentially impact a variety of values, people, and social contexts. traditional methods of risk assessment are confronted with several challenges: the lack of a clearly defined technology purpose, the lack of a clearly defined values to orient on, the heterogeneity of uses, and the difficulty of actively engaging citizens themselves in anticipating impacts from the perspective of their individual lived realities. in this article, we leverage scenario writing at scale as a method for anticipating ai impact that is responsive to these challenges. the advantages of the scenario method are its ability to engage individual users and stimulate them to consider how chatbots are likely to affect their reality and so collect different impact scenarios depending on the cultural and societal embedding of a heterogeneous citizenship. empirically, we tasked 106 us-citizens to write short fictional stories about the future impact (whether desirable or undesirable) of ai-based personal chatbots on individuals and society and, in addition, ask respondents to explain why these impacts are important and how they relate to their values. in the analysis process, we map those impacts and analyze them in relation to socio-demographic as well as ai-related attitudes of the scenario writers. we show that our method is effective in (1) identifying and mapping desirable and undesirable impacts of ai-based personal chatbots, (2) setting these impacts in relation to values that are important for individuals, and (3) detecting socio-demographic and ai-attitude related differences of impact anticipation.

2024-01-24

Hongzhan Lin, Ziyang Luo, Wei Gao, Jing Ma, Bo Wang, Ruichao Yang
Abstract: the age of social media is flooded with internet memes, necessitating a clear grasp and effective identification of harmful ones. this task presents a significant challenge due to the implicit meaning embedded in memes, which is not explicitly conveyed through the surface text and image. however, existing harmful meme detection methods do not present readable explanations that unveil such implicit meaning to support their detection decisions. in this paper, we propose an explainable approach to detect harmful memes, achieved through reasoning over conflicting rationales from both harmless and harmful positions. specifically, inspired by the powerful capacity of large language models (llms) on text generation and reasoning, we first elicit multimodal debate between llms to generate the explanations derived from the contradictory arguments. then we propose to fine-tune a small language model as the debate judge for harmfulness inference, to facilitate multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. in this way, our model is empowered to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations originating from both harmless and harmful arguments. extensive experiments on three public meme datasets demonstrate that our harmful meme detection approach achieves much better performance than state-of-the-art methods and exhibits a superior capacity for explaining the meme harmfulness of the model predictions.
Kimon Kieslich, Marco Lünich
Abstract: ai is increasingly being used in the public sector, including public security. in this context, the use of ai-powered remote biometric identification (rbi) systems is a much-discussed technology. rbi systems are used to identify criminal activity in public spaces, but are criticised for inheriting biases and violating fundamental human rights. it is therefore important to ensure that such systems are developed in the public interest, which means that any technology that is deployed for public use needs to be scrutinised. while there is a consensus among business leaders, policymakers and scientists that ai must be developed in an ethical and trustworthy manner, scholars have argued that ethical guidelines do not guarantee ethical ai, but rather prevent stronger regulation of ai. as a possible counterweight, public opinion can have a decisive influence on policymakers to establish boundaries and conditions under which ai systems should be used -- if at all. however, we know little about the conditions that lead to regulatory demand for ai systems. in this study, we focus on the role of trust in ai as well as trust in law enforcement as potential factors that may lead to demands for regulation of ai technology. in addition, we explore the mediating effects of discrimination perceptions regarding rbi. we test the effects on four different use cases of rbi varying the temporal aspect (real-time vs. post hoc analysis) and purpose of use (persecution of criminals vs. safeguarding public events) in a survey among german citizens. we found that german citizens do not differentiate between the different modes of application in terms of their demand for rbi regulation. furthermore, we show that perceptions of discrimination lead to a demand for stronger regulation, while trust in ai and trust in law enforcement lead to opposite effects in terms of demand for a ban on rbi systems.
Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth
Abstract: for large language models (llms) to be trusted by humans they need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct. recent work has focused on the quality of internal llm confidence assessments, but the question remains of how well llms can communicate this internal model confidence to human users. this paper explores the disparity between external human confidence in an llm's responses and the internal confidence of the model. through experiments involving multiple-choice questions, we systematically examine human users' ability to discern the reliability of llm outputs. our study focuses on two key areas: (1) assessing users' perception of true llm confidence and (2) investigating the impact of tailored explanations on this perception. the research highlights that default explanations from llms often lead to user overestimation of both the model's confidence and its' accuracy. by modifying the explanations to more accurately reflect the llm's internal confidence, we observe a significant shift in user perception, aligning it more closely with the model's actual confidence levels. this adjustment in explanatory approach demonstrates potential for enhancing user trust and accuracy in assessing llm outputs. the findings underscore the importance of transparent communication of confidence levels in llms, particularly in high-stakes applications where understanding the reliability of ai-generated information is essential.
Nayoung Kim, Myke C. Cohen, Yang Ba, Anna Pan, Shawaiz Bhatti, Pouria Salehi, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
Abstract: designing for ai trustworthiness is challenging, with a lack of practical guidance despite extensive literature on trust. the multisource ai scorecard table (mast), a checklist rating system, addresses this gap in designing and evaluating ai-enabled decision support systems. we propose the principled approach for designing trustable human-centered ai systems using mast methodology (padthai-mm), a nine-step framework what we demonstrate through the iterative design of a text analysis platform called the reporting assistant for defense and intelligence tasks (readit). we designed two versions of readit, high-mast including ai context and explanations, and low-mast resembling a "black box" type system. participant feedback and state-of-the-art ai knowledge was integrated in the design process, leading to a redesigned prototype tested by participants in an intelligence reporting task. results show that mast-guided design can improve trust perceptions, and that mast criteria can be linked to performance, process, and purpose information, providing a practical and theory-informed basis for ai system design.
Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, Zhiyong Lu
Abstract: large language models like gpt-3.5-turbo and gpt-4 hold promise for healthcare professionals, but they may inadvertently inherit biases during their training, potentially affecting their utility in medical applications. despite few attempts in the past, the precise impact and extent of these biases remain uncertain. through both qualitative and quantitative analyses, we find that these models tend to project higher costs and longer hospitalizations for white populations and exhibit optimistic views in challenging medical scenarios with much higher survival rates. these biases, which mirror real-world healthcare disparities, are evident in the generation of patient backgrounds, the association of specific diseases with certain races, and disparities in treatment recommendations, etc. our findings underscore the critical need for future research to address and mitigate biases in language models, especially in critical healthcare applications, to ensure fair and accurate outcomes for all patients.
Yepeng Liu, Yuheng Bu
Abstract: the advancement of large language models (llms) has led to increasing concerns about the misuse of ai-generated text, and watermarking for llm-generated text has emerged as a potential solution. however, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. this paper proposes an adaptive watermarking strategy to address this problem. to improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. for the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. our experiments involving various llms demonstrate that our approach achieves comparable robustness performance to existing watermark methods. additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} llms while maintaining security even under various attacks.

2024-01-23

Krishna Ronanki, Beatriz Cabrero-Daniel, Christian Berger
Abstract: recent generative artificial intelligence (genai) trends focus on various applications, including creating stories, illustrations, poems, articles, computer code, music compositions, and videos. extrinsic hallucinations are a critical limitation of such genai, which can lead to significant challenges in achieving and maintaining the trustworthiness of genai. in this paper, we propose two new concepts that we believe will aid the research community in addressing limitations associated with the application of genai models. first, we propose a definition for the "desirability" of genai outputs and three factors which are observed to influence it. second, drawing inspiration from martin fowler's code smells, we propose the concept of "prompt smells" and the adverse effects they are observed to have on the desirability of genai outputs. we expect our work will contribute to the ongoing conversation about the desirability of genai outputs and help advance the field in a meaningful way.
Haoyan Luo, Lucia Specia
Abstract: this survey paper delves into the burgeoning field of explainability for large language models (llms), a critical yet challenging aspect of natural language processing. with llms playing a pivotal role in various applications, their "black-box" nature raises concerns about transparency and ethical use. this paper emphasizes the necessity for enhanced explainability in llms, addressing both the general public's trust and the technical community's need for a deeper understanding of these models. we concentrate on pre-trained transformer-based llms, such as llama, which present unique interpretability challenges due to their scale and complexity. our review categorizes existing explainability methods and discusses their application in improving model transparency and reliability. we also discuss representative evaluation methods, highlighting their strengths and limitations. the goal of this survey is to bridge the gap between theoretical understanding and practical application, offering insights for future research and development in the field of llm explainability.
Mukai Li, Lei Li, Yuwei Yin, Masood Ahmed, Zhenguang Liu, Qi Liu
Abstract: vlms (vision-language models) extend the capabilities of llms (large language models) to accept multimodal inputs. since it has been verified that llms can be induced to generate harmful or inaccurate content through specific test cases (termed as red teaming), how vlms perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. to explore this problem, we present a novel red teaming dataset rtvlm, which encompasses 10 subtasks (e.g., image misleading, multi-modal jail-breaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). our rtvlm is the first red-teaming dataset to benchmark current vlms in terms of these 4 different aspects. detailed analysis shows that 10 prominent open-sourced vlms struggle with the red teaming in different degrees and have up to 31% performance gap with gpt-4v. additionally, we simply apply red teaming alignment to llava-v1.5 with supervised fine-tuning (sft) using rtvlm, and this bolsters the models' performance with 10% in rtvlm test set, 13% in mm-hal, and without noticeable decline in mm-bench, overpassing other llava-based models with regular alignment data. this reveals that current open-sourced vlms still lack red teaming alignment. our code and datasets will be open-source.
Rick Rejeleene, Xiaowei Xu, John Talburt
Abstract: large language models (llm) are generating information at a rapid pace, requiring users to increasingly rely and trust the data. despite remarkable advances of llm, information generated by llm is not completely trustworthy, due to challenges in information quality. specifically, integrity of information quality decreases due to unreliable, biased, tokenization during pre-training of llm. moreover, due to decreased information quality issues, has led towards hallucination, fabricated information. unreliable information can lead towards flawed decisions in businesses, which impacts economic activity. in this work, we introduce novel mathematical information quality evaluation of llm, we furthermore analyze and highlight information quality challenges, scaling laws to systematically scale language models.
Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, Daniel Khashabi
Abstract: as the influence of large language models (llms) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. this paper examines the variations in safety challenges faced by llms across different languages and discusses approaches to alleviating such concerns. by comparing how state-of-the-art llms respond to the same set of malicious prompts written in higher- vs. lower-resource languages, we observe that (1) llms tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) llms tend to generate more irrelevant responses to malicious prompts in lower-resource languages. to understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (rlhf) or supervised finetuning (sft) on the hh-rlhf dataset. surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. this suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. our findings highlight the challenges in cross-lingual llm safety, and we hope they inform future research in this direction.
Alan Chan, Carson Ezell, Max Kaufmann, Kevin Wei, Lewis Hammond, Herbie Bradley, Emma Bluemke, Nitarshan Rajkumar, David Krueger, Noam Kolt, Lennart Heim, Markus Anderljung
Abstract: increased delegation of commercial, scientific, governmental, and personal activities to ai agents -- systems capable of pursuing complex goals with limited supervision -- may exacerbate existing societal risks and introduce new risks. understanding and mitigating these risks involves critically evaluating existing governance structures, revising and adapting these structures where needed, and ensuring accountability of key stakeholders. information about where, why, how, and by whom certain ai agents are used, which we refer to as visibility, is critical to these objectives. in this paper, we assess three categories of measures to increase visibility into ai agents: agent identifiers, real-time monitoring, and activity logging. for each, we outline potential implementations that vary in intrusiveness and informativeness. we analyze how the measures apply across a spectrum of centralized through decentralized deployment contexts, accounting for various actors in the supply chain including hardware and software service providers. finally, we discuss the implications of our measures for privacy and concentration of power. further work into understanding the measures and mitigating their negative impacts can help to build a foundation for the governance of ai agents.

2024-01-22

Ziwei Xu, Sanjay Jain, Mohan Kankanhalli
Abstract: hallucination has been widely recognized to be a significant drawback for large language models (llms). there have been many works that attempt to reduce the extent of hallucination. these efforts have mostly been empirical so far, which cannot answer the fundamental question whether it can be completely eliminated. in this paper, we formalize the problem and show that it is impossible to eliminate hallucination in llms. specifically, we define a formal world where hallucination is defined as inconsistencies between a computable llm and a computable ground truth function. by employing results from learning theory, we show that llms cannot learn all of the computable functions and will therefore always hallucinate. since the formal world is a part of the real world which is much more complicated, hallucinations are also inevitable for real world llms. furthermore, for real world llms constrained by provable time complexity, we describe the hallucination-prone tasks and empirically validate our claims. finally, using the formal world framework, we discuss the possible mechanisms and efficacies of existing hallucination mitigators as well as the practical implications on the safe deployment of llms.
Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, Jing Shao
Abstract: multi-agent systems, augmented with large language models (llms), demonstrate significant capabilities for collective intelligence. however, the potential misuse of this intelligence for malicious purposes presents significant risks. to date, comprehensive research on the safety issues associated with multi-agent systems remains limited. from the perspective of agent psychology, we discover that the dark psychological states of agents can lead to severe safety issues. to address these issues, we propose a comprehensive framework grounded in agent psychology. in our framework, we focus on three aspects: identifying how dark personality traits in agents might lead to risky behaviors, designing defense strategies to mitigate these risks, and evaluating the safety of multi-agent systems from both psychological and behavioral perspectives. our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' propensity for self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and their dangerous behaviors. we anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems. we will make our data and code publicly accessible at https:/github.com/ai4good24/psysafe.
Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn
Abstract: the success of reinforcement learning from human feedback (rlhf) in language model alignment is strongly dependent on the quality of the underlying reward model. in this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. motivated by the promising results of best-of-n sampling strategies in language model training, we extend their application to reward model training. this results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. this work opens up new avenues of research for improving rlhf for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.
Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
Abstract: aligning large language models (llms) with human preferences through reinforcement learning (rlhf) can lead to reward hacking, where llms exploit failures in the reward model (rm) to achieve seemingly high rewards without meeting the underlying objectives. we identify two primary challenges when designing rms to mitigate reward hacking: distribution shifts during the rl process and inconsistencies in human preferences. as a solution, we propose weight averaged reward models (warm), first fine-tuning multiple rms, then averaging them in the weight space. this strategy follows the observation that fine-tuned weights remain linearly mode connected when sharing the same pre-training. by averaging weights, warm improves efficiency compared to the traditional ensembling of predictions, while improving reliability under distribution shifts and robustness to preference inconsistencies. our experiments on summarization tasks, using best-of-n and rl methods, shows that warm improves the overall quality and alignment of llm predictions; for example, a policy rl fine-tuned with warm has a 79.4% win rate against a policy rl fine-tuned with a single rm.
Ashutosh Kumar, Sagarika Singh, Shiv Vignesh Murty, Swathy Ragupathy
Abstract: this paper comprehensively explores the ethical challenges arising from security threats to language learning models (llms). these intricate digital repositories are increasingly integrated into our daily lives, making them prime targets for attacks that can compromise their training data and the confidentiality of their data sources. the paper delves into the nuanced ethical repercussions of such security threats on society and individual privacy. we scrutinize five major threats: prompt injection, jailbreaking, personal identifiable information (pii) exposure, sexually explicit content, and hate based content, going beyond mere identification to assess their critical ethical consequences and the urgency they create for robust defensive strategies. the escalating reliance on llms underscores the crucial need for ensuring these systems operate within the bounds of ethical norms, particularly as their misuse can lead to significant societal and individual harm. we propose conceptualizing and developing an evaluative tool tailored for llms, which would serve a dual purpose, guiding developers and designers in preemptive fortification of backend systems and scrutinizing the ethical dimensions of llm chatbot responses during the testing phase. by comparing llm responses with those expected from humans in a moral context, we aim to discern the degree to which ai behaviors align with the ethical values held by a broader society. ultimately, this paper not only underscores the ethical troubles presented by llms, it also highlights a path toward cultivating trust in these systems.
Weixin Chen, Bo Li
Abstract: truthfulness is paramount for large language models (llms) as they are increasingly deployed in real-world applications. however, existing llms still struggle with generating truthful answers and content, as evidenced by their modest performance on benchmarks like truthfulqa. to address this issue, we propose gradual self-truthifying (grath), a novel post-processing method to enhance truthfulness of llms. grath utilizes out-of-domain question prompts to generate corresponding answers and adaptively optimizes the model via direct preference optimization (dpo). note that during this process, grath learns truthfulness in a self-supervised manner without requiring annotated answers. in particular, grath first generates pairwise truthfulness training data by prompting the llm itself, with each pair containing a question and its correct and incorrect answers. the model is then fine-tuned using dpo to learn from the difference between answer pairs. subsequently, grath iteratively refines the truthfulness data and optimizes the model, leading to a gradual improvement in model truthfulness. empirically, we evaluate grath using different 7b-llms and compare with llms with similar or even larger sizes on benchmark datasets. our results show that grath effectively improves llms' truthfulness without compromising other core capabilities. notably, grath achieves state-of-the-art performance on truthfulqa, with mc1 accuracy as 54.71% and mc2 accuracy as 69.10%, which even surpass those on larger-scale models, such as llama2-chat-70b, by 23.62% and 24.18%, respectively.
Kyrie Zhixuan Zhou, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Ted Underwood, Ece Gumusel, Mengyi Wei, Abhinav Choudhry, Jinjun Xiong
Abstract: large language models (llms) are advancing quickly and impacting people's lives for better or worse. in higher education, concerns have emerged such as students' misuse of llms and degraded education outcomes. to unpack the ethical concerns of llms for higher education, we conducted a case study consisting of stakeholder interviews (n=20) in higher education computer science. we found that students use several distinct mental models to interact with llms - llms serve as a tool for (a) writing, (b) coding, and (c) information retrieval, which differ somewhat in ethical considerations. students and teachers brought up ethical issues that directly impact them, such as inaccurate llm responses, hallucinations, biases, privacy leakage, and academic integrity issues. participants emphasized the necessity of guidance and rules for the use of llms in higher education, including teaching digital literacy, rethinking education, and having cautious and contextual policies. we reflect on the ethical challenges and propose solutions.
Zhaoyue Wang
Abstract: when we design and deploy an reinforcement learning (rl) agent, reward functions motivates agents to achieve an objective. an incorrect or incomplete specification of the objective can result in behavior that does not align with human values - failing to adhere with social and moral norms that are ambiguous and context dependent, and cause undesired outcomes such as negative side effects and exploration that is unsafe. previous work have manually defined reward functions to avoid negative side effects, use human oversight for safe exploration, or use foundation models as planning tools. this work studies the ability of leveraging large language models (llm)' understanding of morality and social norms on safe exploration augmented rl methods. this work evaluates language model's result against human feedbacks and demonstrates language model's capability as direct reward signals.
Keming Lu, Bowen Yu, Chang Zhou, Jingren Zhou
Abstract: considerable efforts have been invested in augmenting the role-playing proficiency of open-source large language models (llms) by emulating proprietary counterparts. nevertheless, we posit that llms inherently harbor role-play capabilities, owing to the extensive knowledge of characters and potential dialogues ingrained in their vast training corpora. thus, in this study, we introduce ditto, a self-alignment method for role-play. ditto capitalizes on character knowledge, encouraging an instruction-following llm to simulate role-play dialogues as a variant of reading comprehension. this method creates a role-play training set comprising 4,000 characters, surpassing the scale of currently available datasets by tenfold regarding the number of roles. subsequently, we fine-tune the llm using this self-generated dataset to augment its role-playing capabilities. upon evaluating our meticulously constructed and reproducible role-play benchmark and the roleplay subset of mt-bench, ditto, in various parameter scales, consistently maintains a consistent role identity and provides accurate role-specific knowledge in multi-turn role-play conversations. notably, it outperforms all open-source role-play baselines, showcasing performance levels comparable to advanced proprietary chatbots. furthermore, we present the first comprehensive cross-supervision alignment experiment in the role-play domain, revealing that the intrinsic capabilities of llms confine the knowledge within role-play. meanwhile, the role-play styles can be easily acquired with the guidance of smaller models. we open-source related resources at https://github.com/ofa-sys/ditto.

2024-01-21

Songyang Gao, Qiming Ge, Wei Shen, Shihan Dou, Junjie Ye, Xiao Wang, Rui Zheng, Yicheng Zou, Zhi Chen, Hang Yan, Qi Zhang, Dahua Lin
Abstract: the success of ai assistants based on language models (llms) hinges on reinforcement learning from human feedback (rlhf) to comprehend and align with user intentions. however, traditional alignment algorithms, such as ppo, are hampered by complex annotation and training requirements. this reliance limits the applicability of rlhf and hinders the development of professional assistants tailored to diverse human preferences. in this work, we introduce \textit{linear alignment}, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of llm alignment across diverse scenarios. our code and dataset will be published on \url{https://github.com/wizardcoast/linear_alignment.git}.

2024-01-20

Yoo Yeon Sung, Ishani Mondal, Jordan Boyd-Graber
Abstract: dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. however, the advent of large language models (llms) has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. to understand how these models impact adversarial question writing process, we enrich the writing guidance with llms and retrieval models for the authors to reason why their questions are not adversarial. while authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. to address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.
Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
Abstract: with the rapid development of large language models (llms), they are not only used as general-purpose ai assistants but are also customized through further fine-tuning to meet the requirements of different applications. a pivotal factor in the success of current llms is the alignment process. current alignment methods, such as supervised fine-tuning (sft) and reinforcement learning from human feedback (rlhf), focus on training-time alignment and are often complex and cumbersome to implement. therefore, we develop \textbf{inferaligner}, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. inferaligner utilizes safety steering vectors extracted from safety-aligned model to modify the activations of the target model when responding to harmful inputs, thereby guiding the target model to provide harmless responses. experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal large language models (mllms) such as llava. it significantly diminishes the attack success rate (asr) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
Christian Tarsney
Abstract: large language models now possess human-level linguistic abilities in many contexts. this raises the concern that they can be used to deceive and manipulate on unprecedented scales, for instance spreading political misinformation on social media. in future, agentic ai systems might also deceive and manipulate humans for their own ends. in this paper, first, i argue that ai-generated content should be subject to stricter standards against deception and manipulation than we ordinarily apply to humans. second, i offer new characterizations of ai deception and manipulation meant to support such standards, according to which a statement is deceptive (manipulative) if it leads human addressees away from the beliefs (choices) they would endorse under ``semi-ideal'' conditions. third, i propose two measures to guard against ai deception and manipulation, inspired by this characterization: "extreme transparency" requirements for ai-generated content and defensive systems that, among other things, annotate ai-generated statements with contextualizing information. finally, i consider to what extent these measures can protect against deceptive behavior in future, agentic ais, and argue that non-agentic defensive systems can provide an important layer of defense even against more powerful agentic systems.

2024-01-19

Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Abstract: in the rapidly advancing field of artificial intelligence, the concept of red-teaming or jailbreaking large language models (llms) has emerged as a crucial area of study. this approach is especially significant in terms of assessing and enhancing the safety and robustness of these models. this paper investigates the intricate consequences of such modifications through model editing, uncovering a complex relationship between enhancing model accuracy and preserving its ethical integrity. our in-depth analysis reveals a striking paradox: while injecting accurate information is crucial for model reliability, it can paradoxically destabilize the model's foundational framework, resulting in unpredictable and potentially unsafe behaviors. additionally, we propose a benchmark dataset nichehazardqa to investigate this unsafe behavior both within the same and cross topical domain. this aspect of our research sheds light on how the edits, impact the model's safety metrics and guardrails. our findings show that model editing serves as a cost-effective tool for topical red-teaming by methodically applying targeted edits and evaluating the resultant model behavior
Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi
Abstract: while large language models (llms) have proven to be exceptional on a variety of tasks after alignment, they may still produce responses that contradict the context or world knowledge confidently, a phenomenon known as ``hallucination''. in this paper, we demonstrate that reducing the inconsistency between the external knowledge encapsulated in the training data and the intrinsic knowledge inherited in the pretraining corpus could mitigate hallucination in alignment. specifically, we introduce a novel knowledge consistent alignment (kca) approach, which involves automatically formulating examinations based on external knowledge for accessing the comprehension of llms. for data encompassing knowledge inconsistency, kca implements several simple yet efficient strategies for processing. we illustrate the superior performance of the proposed kca approach in mitigating hallucinations across six benchmarks using llms of different backbones and scales. furthermore, we confirm the correlation between knowledge inconsistency and hallucination, signifying the effectiveness of reducing knowledge inconsistency in alleviating hallucinations. our code, model weights, and data are public at \url{https://github.com/fanqiwan/kca}.
Adib Hasan, Ileana Rugina, Alex Wang
Abstract: large language models (llms) are vulnerable to `jailbreaking' prompts, a type of attack that can coax these models into generating harmful and illegal content. in this paper, we show that pruning up to 20% of llm parameters markedly increases their resistance to such attacks without additional training and without sacrificing their performance in standard benchmarks. intriguingly, we discovered that the enhanced safety observed post-pruning correlates to the initial safety training level of the model, hinting that the effect of pruning could be more general and may hold for other llm behaviors beyond safety. additionally, we introduce a curated dataset of 225 harmful tasks across five categories, inserted into ten different jailbreaking prompts, showing that pruning aids llms in concentrating attention on task-relevant tokens in jailbreaking prompts. lastly, our experiments reveal that the prominent chat models, such as llama-2 chat, vicuna, and mistral instruct exhibit high susceptibility to jailbreaking attacks, with some categories achieving nearly 70-100% success rate. these insights underline the potential of pruning as a generalizable approach for improving llm safety, reliability, and potentially other desired behaviors.
Shaina Raza, Shardul Ghuge, Chen Ding, Deval Pandya
Abstract: the rapid evolution of large language models (llms) underscores the critical importance of ethical considerations and data integrity in ai development, emphasizing the role of fair (findable, accessible, interoperable, reusable) data principles. while these principles have long been a cornerstone of ethical data stewardship, their application in llm training data is less prevalent, an issue our research aims to address. our study begins with a review of existing literature, highlighting the significance of fair principles in data management for model training. building on this foundation, we introduce a novel framework that incorporates fair principles into the llm training process. a key aspect of this approach is a comprehensive checklist, designed to assist researchers and developers in consistently applying fair data principles throughout the model development lifecycle. the practicality and effectiveness of our framework are demonstrated through a case study that involves creating a fair-compliant dataset to detect and reduce biases. this case study not only validates the usefulness of our framework but also establishes new benchmarks for more equitable, transparent, and ethical practices in llm training. we offer this framework to the community as a means to promote technologically advanced, ethically sound, and socially responsible ai models.
Chaofan Shou, Jing Liu, Doudou Lu, Koushik Sen
Abstract: as blockchain platforms grow exponentially, millions of lines of smart contract code are being deployed to manage extensive digital assets. however, vulnerabilities in this mission-critical code have led to significant exploitations and asset losses. thorough automated security analysis of smart contracts is thus imperative. this paper introduces llm4fuzz to optimize automated smart contract security analysis by leveraging large language models (llms) to intelligently guide and prioritize fuzzing campaigns. while traditional fuzzing suffers from low efficiency in exploring the vast state space, llm4fuzz employs llms to direct fuzzers towards high-value code regions and input sequences more likely to trigger vulnerabilities. additionally, llm4fuzz can leverage llms to guide fuzzers based on user-defined invariants, reducing blind exploration overhead. evaluations of llm4fuzz on real-world defi projects show substantial gains in efficiency, coverage, and vulnerability detection compared to baseline fuzzing. llm4fuzz also uncovered five critical vulnerabilities that can lead to a loss of more than $247k.
Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
Abstract: large language models (llms) are shown to benefit from chain-of-thought (cot) prompting, particularly when tackling tasks that require systematic reasoning processes. on the other hand, cot prompting also poses new vulnerabilities in the form of backdoor attacks, wherein the model will output unintended malicious content under specific backdoor-triggered conditions during inference. traditional methods for launching backdoor attacks involve either contaminating the training dataset with backdoored instances or directly manipulating the model parameters during deployment. however, these approaches are not practical for commercial llms that typically operate via api access. in this paper, we propose badchain, the first backdoor attack against llms employing cot prompting, which does not require access to the training dataset or model parameters and imposes low computational overhead. badchain leverages the inherent reasoning capabilities of llms by inserting a backdoor reasoning step into the sequence of reasoning steps of the model output, thereby altering the final response when a backdoor trigger exists in the query prompt. empirically, we show the effectiveness of badchain for two cot strategies across four llms (llama2, gpt-3.5, palm2, and gpt-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. moreover, we show that llms endowed with stronger reasoning capabilities exhibit higher susceptibility to badchain, exemplified by a high average attack success rate of 97.0% across the six benchmark tasks on gpt-4. finally, we propose two defenses based on shuffling and demonstrate their overall ineffectiveness against badchain. therefore, badchain remains a severe threat to llms, underscoring the urgency for the development of robust and effective future defenses.

2024-01-18

Mazal Bethany, Athanasios Galiopoulos, Emet Bethany, Mohammad Bahrami Karkevandi, Nishant Vishwamitra, Peyman Najafirad
Abstract: the critical threat of phishing emails has been further exacerbated by the potential of llms to generate highly targeted, personalized, and automated spear phishing attacks. two critical problems concerning llm-facilitated phishing require further investigation: 1) existing studies on lateral phishing lack specific examination of llm integration for large-scale attacks targeting the entire organization, and 2) current anti-phishing infrastructure, despite its extensive development, lacks the capability to prevent llm-generated attacks, potentially impacting both employees and it security incident management. however, the execution of such investigative studies necessitates a real-world environment, one that functions during regular business operations and mirrors the complexity of a large organizational infrastructure. this setting must also offer the flexibility required to facilitate a diverse array of experimental conditions, particularly the incorporation of phishing emails crafted by llms. this study is a pioneering exploration into the use of large language models (llms) for the creation of targeted lateral phishing emails, targeting a large tier 1 university's operation and workforce of approximately 9,000 individuals over an 11-month period. it also evaluates the capability of email filtering infrastructure to detect such llm-generated phishing attempts, providing insights into their effectiveness and identifying potential areas for improvement. based on our findings, we propose machine learning-based detection techniques for such emails to detect llm-generated phishing emails that were missed by the existing infrastructure, with an f1-score of 98.96.
Wei Huang, Yinggui Wang, Anda Cheng, Aihui Zhou, Chaofan Yu, Lei Wang
Abstract: the distributed (federated) llm is an important method for co-training the domain-specific llm using siloed data. however, maliciously stealing model parameters and data from the server or client side has become an urgent problem to be solved. in this paper, we propose a secure distributed llm based on model slicing. in this case, we deploy the trusted execution environment (tee) on both the client and server side, and put the fine-tuned structure (lora or embedding of p-tuning v2) into the tee. then, secure communication is executed in the tee and general environments through lightweight encryption. in order to further reduce the equipment cost as well as increase the model performance and accuracy, we propose a split fine-tuning scheme. in particular, we split the llm by layers and place the latter layers in a server-side tee (the client does not need a tee). we then combine the proposed sparsification parameter fine-tuning (spf) with the lora part to improve the accuracy of the downstream task. numerous experiments have shown that our method guarantees accuracy while maintaining security.
Kazuhiro Takemoto
Abstract: large language models (llms) like chatgpt face `jailbreak' challenges, where safeguards are bypassed to produce ethically harmful prompts. this study introduces a simple black-box method to effectively generate jailbreak prompts, overcoming the limitations of high complexity and computational costs associated with existing methods. the proposed technique iteratively rewrites harmful prompts into non-harmful expressions using the target llm itself, based on the hypothesis that llms can directly sample safeguard-bypassing expressions. demonstrated through experiments with chatgpt (gpt-3.5 and gpt-4) and gemini-pro, this method achieved an attack success rate of over 80% within an average of 5 iterations and remained effective despite model updates. the jailbreak prompts generated were naturally-worded and concise, suggesting they are less detectable. the results indicate that creating effective jailbreak prompts is simpler than previously considered, and black-box jailbreak attacks pose a more serious security threat.
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu
Abstract: large language models (llms) have exhibited great potential in autonomously completing tasks across real-world applications. despite this, these llm agents introduce unexpected safety risks when operating in interactive environments. instead of centering on llm-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of llm agents within diverse environments. we introduce r-judge, a benchmark crafted to evaluate the proficiency of llms in judging safety risks given agent interaction records. r-judge comprises 162 agent interaction records, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. it incorporates human consensus on safety with annotated safety risk labels and high-quality risk descriptions. utilizing r-judge, we conduct a comprehensive evaluation of 8 prominent llms commonly employed as the backbone for agents. the best-performing model, gpt-4, achieves 72.29% in contrast to the human score of 89.38%, showing considerable room for enhancing the risk awareness of llms. notably, leveraging risk descriptions as environment feedback significantly improves model performance, revealing the importance of salient safety risk feedback. furthermore, we design an effective chain of safety analysis technique to help the judgment of safety risks and conduct an in-depth case study to facilitate future research. r-judge is publicly available at https://github.com/lordog/r-judge.
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Abstract: we posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during llm training. in this work, we study self-rewarding language models, where the language model itself is used via llm-as-a-judge prompting to provide its own rewards during training. we show that during iterative dpo training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. fine-tuning llama 2 70b on three iterations of our approach yields a model that outperforms many existing systems on the alpacaeval 2.0 leaderboard, including claude 2, gemini pro, and gpt-4 0613. while only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

2024-01-17

Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, Yongfeng Zhang
Abstract: in our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on large language models (llms), such as gpt-4 and llama2, diverging from traditional robustness-focused binary evaluations. our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. this dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline's trend while offering a more profound and detailed assessment. we believe that by accurately evaluating the effectiveness of attack prompts in the jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.
Sagiv Antebi, Noam Azulay, Edan Habler, Ben Ganon, Asaf Shabtai, Yuval Elovici
Abstract: in november 2023, openai introduced a new service allowing users to create custom versions of chatgpt (gpts) by using specific instructions and knowledge to guide the model's behavior. we aim to raise awareness of the fact that gpts can be used maliciously, posing privacy and security risks to their users.
Lize Alberts, Geoff Keeling, Amanda Mccroskery
Abstract: with the growing popularity of dialogue agents based on large language models (llms), urgent attention has been drawn to finding ways to ensure their behaviour is ethical and appropriate. these are largely interpreted in terms of the 'hhh' criteria: making outputs more helpful and honest, and avoiding harmful (biased, toxic, or inaccurate) statements. whilst this semantic focus is useful from the perspective of viewing llm agents as mere mediums for information, it fails to account for pragmatic factors that can make the same utterance seem more or less offensive or tactless in different social situations. we propose an approach to ethics that is more centred on relational and situational factors, exploring what it means for a system, as a social actor, to treat an individual respectfully in a (series of) interaction(s). our work anticipates a set of largely unexplored risks at the level of situated interaction, and offers practical suggestions to help llm technologies behave as 'good' social actors and treat people respectfully.
Bradley Butcher
Abstract: advancements in large language models (llms) have demonstrated remarkable capabilities across a diverse range of applications. these models excel in generating text completions that are contextually coherent and cover an extensive array of subjects. however, the vast datasets required for their training make aligning response styles during the pretraining and instruction tuning phases challenging. consequently, an additional alignment phase is typically employed, wherein the model is further trained with human preference data to better align its outputs with human expectations. while this process doesn't introduce new capabilities per se, it does accentuate generation styles innate to the model. this paper explores the utilization of counterfactual prompting within the framework of direct preference optimization (dpo) to align the model's style without relying on human intervention. we demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions. our findings suggest that counterfactual prompting with dpo presents a low-resource way to fine-tune llms to meet the demands for responsible and ethically aligned ai systems.

2024-01-16

Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, Gregory Dudek
Abstract: large language models (llms), including chatgpt, bard, and llama, have achieved remarkable successes over the last two years in a range of different applications. in spite of these successes, there exist concerns that limit the wide application of llms. a key problem is the problem of hallucination. hallucination refers to the fact that in addition to correct responses, llms can also generate seemingly correct but factually incorrect responses. this report aims to present a comprehensive review of the current literature on both hallucination detection and hallucination mitigation. we hope that this report can serve as a good reference for both engineers and researchers who are interested in llms and applying them to real world tasks.
Simone Balloccu, Ehud Reiter, Vivek Kumar, Diego Reforgiato Recupero, Daniele Riboni
Abstract: large language models (llms), with their flexible generation abilities, can be powerful data sources in domains with few or no available corpora. however, problems like hallucinations and biases limit such applications. in this case study, we pick nutrition counselling, a domain lacking any public resource, and show that high-quality datasets can be gathered by combining llms, crowd-workers and nutrition experts. we first crowd-source and cluster a novel dataset of diet-related issues, then work with experts to prompt chatgpt into producing related supportive text. finally, we let the experts evaluate the safety of the generated text. we release hai-coaching, the first expert-annotated nutrition counselling dataset containing ~2.4k dietary struggles from crowd workers, and ~97k related supportive texts generated by chatgpt. extensive analysis shows that chatgpt while producing highly fluent and human-like text, also manifests harmful behaviours, especially in sensitive topics like mental health, making it unsuitable for unsupervised use.
Tassilo Klein, Moin Nabi
Abstract: the generation of undesirable and factually incorrect content of large language models poses a significant challenge and remains largely an unsolved issue. this paper studies the integration of a contrastive learning objective for fine-tuning llms for implicit knowledge editing and controlled text generation. optimizing the training objective entails aligning text perplexities in a contrastive fashion. to facilitate training the model in a self-supervised fashion, we leverage an off-the-shelf llm for training data generation. we showcase applicability in the domain of detoxification. herein, the proposed approach leads to a significant decrease in the generation of toxic content while preserving general utility for downstream tasks such as commonsense reasoning and reading comprehension. the proposed approach is conceptually simple but empirically powerful.
Messi H. J. Lee, Jacob M. Montgomery, Calvin K. Lai
Abstract: large language models (llms) have become pervasive in everyday life, yet their inner workings remain opaque. while scholarly efforts have demonstrated llms' propensity to reproduce biases in their training data, they have primarily focused on the association of social groups with stereotypic attributes. in this paper, we extend this line of inquiry to investigate a bias akin to the social-psychological phenomenon where socially dominant groups are perceived to be less homogeneous than socially subordinate groups as it is reproduced by llms. we had chatgpt, a state-of-the-art llm, generate a diversity of texts about intersectional group identities and compared text homogeneity. we consistently find that llms portray african, asian, and hispanic americans as more homogeneous than white americans. they also portray women as more homogeneous than men, but these differences are small. finally, we find that the effect of gender differs across racial/ethnic groups such that the effect of gender is consistent within african and hispanic americans but not within asian and white americans. we speculate possible sources of this bias in llms and posit that the bias has the potential to amplify biases in future llm training and to reinforce stereotypes.
Masahiro Kaneko, Danushka Bollegala, Timothy Baldwin
Abstract: the output tendencies of pre-trained language models (plm) vary markedly before and after fine-tuning (ft) due to the updates to the model parameters. these divergences in output tendencies result in a gap in the social biases of plms. for example, there exits a low correlation between intrinsic bias scores of a plm and its extrinsic bias scores under ft-based debiasing methods. additionally, applying ft-based debiasing methods to a plm leads to a decline in performance in downstream tasks. on the other hand, plms trained on large datasets can learn without parameter updates via in-context learning (icl) using prompts. icl induces smaller changes to plms compared to ft-based debiasing methods. therefore, we hypothesize that the gap observed in pre-trained and ft models does not hold true for debiasing methods that use icl. in this study, we demonstrate that icl-based debiasing methods show a higher correlation between intrinsic and extrinsic bias scores compared to ft-based methods. moreover, the performance degradation due to debiasing is also lower in the icl case compared to that in the ft case.
Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, Noah A. Smith
Abstract: despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. however, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. we introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box lms to achieve the result of directly tuning the model, but by accessing only its prediction over the output vocabulary. our method instead tunes a smaller lm, then applies the difference between the predictions of the small tuned and untuned lms to shift the original predictions of the base model in the direction of tuning, while retaining the benefits of larger scale pretraining. in experiments, when we apply proxy-tuning to llama2-70b using proxies of only 7b size, we can close 88% of the gap between llama2-70b and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. interestingly, when tested on truthfulqa, proxy-tuned models are actually more truthful than directly tuned models, possibly because decoding-time guidance better retains the model's factual knowledge. we then demonstrate the generality of proxy-tuning by applying it for domain adaptation on code, and task-specific finetuning on question-answering and math problems. our work demonstrates the promise of using small tuned lms to efficiently customize large, potentially proprietary lms through decoding-time guidance.
Afra Feyza Akyürek, Ekin Akyürek, Leshem Choshen, Derry Wijaya, Jacob Andreas
Abstract: while language models (lms) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. as a consequence, current lms also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. we present a method called deductive closure training (dct) that uses lms themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving lm factuality. given a collection of seed documents, dct prompts lms to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. given seed documents from a trusted source, dct provides a tool for supervised model updating; if seed documents are sampled from the lm itself, dct enables fully unsupervised fine-tuning for improved coherence and accuracy. across the creak, mquake, and reversal curse datasets, supervised dct improves lm fact verification and text generation accuracy by 3-26%; on creak fully unsupervised dct improves verification accuracy by 12%. these results show that lms' reasoning capabilities during inference can be leveraged during training to improve their reliability.

2024-01-15

Xingzhou Lou, Junge Zhang, Ziyan Wang, Kaiqi Huang, Yali Du
Abstract: safe reinforcement learning (rl) agents accomplish given tasks while adhering to specific constraints. employing constraints expressed via easily-understandable human language offers considerable potential for real-world applications due to its accessibility and non-reliance on domain expertise. previous safe rl methods with natural language constraints typically adopt a recurrent neural network, which leads to limited capabilities when dealing with various forms of human language input. furthermore, these methods often require a ground-truth cost function, necessitating domain expertise for the conversion of language constraints into a well-defined cost function that determines constraint violation. to address these issues, we proposes to use pre-trained language models (lm) to facilitate rl agents' comprehension of natural language constraints and allow them to infer costs for safe policy learning. through the use of pre-trained lms and the elimination of the need for a ground-truth cost, our method enhances safe policy learning under a diverse set of human-derived free-form natural language constraints. experiments on grid-world navigation and robot control show that the proposed method can achieve strong performance while adhering to given constraints. the usage of pre-trained lms allows our method to comprehend complicated constraints and learn safe policies without the need for ground-truth cost at any stage of training or evaluation. extensive ablation studies are conducted to demonstrate the efficacy of each part of our method.
Xuchen Suo
Abstract: the critical challenge of prompt injection attacks in large language models (llms) integrated applications, a growing concern in the artificial intelligence (ai) field. such attacks, which manipulate llms through natural language inputs, pose a significant threat to the security of these applications. traditional defense strategies, including output and input filtering, as well as delimiter use, have proven inadequate. this paper introduces the 'signed-prompt' method as a novel solution. the study involves signing sensitive instructions within command segments by authorized users, enabling the llm to discern trusted instruction sources. the paper presents a comprehensive analysis of prompt injection attack patterns, followed by a detailed explanation of the signed-prompt concept, including its basic architecture and implementation through both prompt engineering and fine-tuning of llms. experiments demonstrate the effectiveness of the signed-prompt method, showing substantial resistance to various types of prompt injection attacks, thus validating its potential as a robust defense strategy in ai security.
Sougata Saha, Rohini Srihari
Abstract: hateful comments are prevalent on social media platforms. although tools for automatically detecting, flagging, and blocking such false, offensive, and harmful content online have lately matured, such reactive and brute force methods alone provide short-term and superficial remedies while the perpetrators persist. with the public availability of large language models which can generate articulate synthetic and engaging content at scale, there are concerns about the rapid growth of dissemination of such malicious content on the web. there is now a need to focus on deeper, long-term solutions that involve engaging with the human perpetrator behind the source of the content to change their viewpoint or at least bring down the rhetoric using persuasive means. to do that, we propose defining and experimenting with controllable strategies for generating counter-arguments to hateful comments in online conversations. we experiment with controlling response generation using features based on (i) argument structure and reasoning-based walton argument schemes, (ii) counter-argument speech acts, and (iii) human characteristics-based qualities such as big-5 personality traits and human values. using automatic and human evaluations, we determine the best combination of features that generate fluent, argumentative, and logically sound arguments for countering hate. we further share the developed computational models for automatically annotating text with such features, and a silver-standard annotated version of an existing hate speech dialog corpora.
Atoosa Kasirzadeh
Abstract: the conventional discourse on existential risks (x-risks) from ai typically focuses on abrupt, dire events caused by advanced ai systems, particularly those that might achieve or surpass human-level intelligence. these events have severe consequences that either lead to human extinction or irreversibly cripple human civilization to a point beyond recovery. this discourse, however, often neglects the serious possibility of ai x-risks manifesting incrementally through a series of smaller yet interconnected disruptions, gradually crossing critical thresholds over time. this paper contrasts the conventional "decisive ai x-risk hypothesis" with an "accumulative ai x-risk hypothesis." while the former envisions an overt ai takeover pathway, characterized by scenarios like uncontrollable superintelligence, the latter suggests a different causal pathway to existential catastrophes. this involves a gradual accumulation of critical ai-induced threats such as severe vulnerabilities and systemic erosion of econopolitical structures. the accumulative hypothesis suggests a boiling frog scenario where incremental ai risks slowly converge, undermining resilience until a triggering event results in irreversible collapse. through systems analysis, this paper examines the distinct assumptions differentiating these two hypotheses. it is then argued that the accumulative view reconciles seemingly incompatible perspectives on ai risks. the implications of differentiating between these causal pathways -- the decisive and the accumulative -- for the governance of ai risks as well as long-term ai safety are discussed.
Andreas Madsen, Sarath Chandar, Siva Reddy
Abstract: instruction-tuned large language models (llms) excel at many tasks, and will even provide explanations for their behavior. since these models are directly accessible to the public, there is a risk that convincing and wrong explanations can lead to unsupported confidence in llms. therefore, interpretability-faithfulness of self-explanations is an important consideration for ai safety. assessing the interpretability-faithfulness of these explanations, termed self-explanations, is challenging as the models are too complex for humans to annotate what is a correct explanation. to address this, we propose employing self-consistency checks as a measure of faithfulness. for example, if an llm says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. while self-consistency checks are a common approach to faithfulness, they have not previously been applied to llm's self-explanations. we apply self-consistency checks to three types of self-explanations: counterfactuals, importance measures, and redactions. our work demonstrate that faithfulness is both task and model dependent, e.g., for sentiment classification, counterfactual explanations are more faithful for llama2, importance measures for mistral, and redaction for falcon 40b. finally, our findings are robust to prompt-variations.
Vimal Kumar, Juliette Mayo, Khadija Bahiss
Abstract: machine learning (ml) and artificial intelligence (ai) techniques have now become commonplace in software products and services. when threat modelling a system, it is therefore important that we consider threats unique to ml and ai techniques, in addition to threats to our software. in this paper, we present a threat model that can be used to systematically uncover threats to ai based software. the threat model consists of two main parts, a model of the software development process for ai based software and an attack taxonomy that has been developed using attacks found in adversarial ai research. we apply the threat model to two real life ai based software and discuss the process and the threats found.
Zhicheng Dou, Yuchen Guo, Ching-Chun Chang, Huy H. Nguyen, Isao Echizen
Abstract: the emergence of large language models (llms), such as generative pre-trained transformer 4 (gpt-4) used by chatgpt, has profoundly impacted the academic and broader community. while these models offer numerous advantages in terms of revolutionizing work and study methods, they have also garnered significant attention due to their potential negative consequences. one example is generating academic reports or papers with little to no human contribution. consequently, researchers have focused on developing detectors to address the misuse of llms. however, most existing methods prioritize achieving higher accuracy on restricted datasets, neglecting the crucial aspect of generalizability. this limitation hinders their practical application in real-life scenarios where reliability is paramount. in this paper, we present a comprehensive analysis of the impact of prompts on the text generated by llms and highlight the potential lack of robustness in one of the current state-of-the-art gpt detectors. to mitigate these issues concerning the misuse of llms in academic writing, we propose a reference-based siamese detector named synthetic-siamese which takes a pair of texts, one as the inquiry and the other as the reference. our method effectively addresses the lack of robustness of previous detectors (openai detector and detectgpt) and significantly improves the baseline performances in realistic academic writing scenarios by approximately 67% to 95%.

2024-01-14

Claudio Novelli, Federico Casolari, Philipp Hacker, Giorgio Spedicato, Luciano Floridi
Abstract: the advent of generative ai, particularly through large language models (llms) like chatgpt and its successors, marks a paradigm shift in the ai landscape. advanced llms exhibit multimodality, handling diverse data formats, thereby broadening their application scope. however, the complexity and emergent autonomy of these models introduce challenges in predictability and legal compliance. this paper delves into the legal and regulatory implications of generative ai and llms in the european union context, analyzing aspects of liability, privacy, intellectual property, and cybersecurity. it critically examines the adequacy of the existing and proposed eu legislation, including the artificial intelligence act (aia) draft, in addressing the unique challenges posed by generative ai in general and llms in particular. the paper identifies potential gaps and shortcomings in the legislative framework and proposes recommendations to ensure the safe and compliant deployment of generative models, ensuring they align with the eu's evolving digital landscape and legal standards.
Meng Cao, Lei Shu, Lei Yu, Yun Zhu, Nevan Wichers, Yinxiao Liu, Lei Meng
Abstract: reinforcement learning (rl) can align language models with non-differentiable reward signals, such as human preferences. however, a major challenge arises from the sparsity of these reward signals - typically, there is only one reward for the entire generation. this sparsity of rewards can lead to inefficient and unstable learning. in this paper, we introduce a novel framework leveraging the critique ability of llms to produce dense rewards throughout the learning process. our approach incorporates a critic language model alongside the policy model. this critic is prompted with the task description, question, policy model's output, and environment's reward signal as input, and provides token or span-level dense rewards that reflect the quality of each segment of the output. we assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. experimental results show that incorporating artificial dense rewards in training yields consistent performance gains over the ppo baseline with holistic rewards. furthermore, in a setting where the same model serves as both policy and critic, we demonstrate that "self-critique" rewards also boost learning efficiency.

2024-01-13

Nafis Tanveer Islam, Peyman Najafirad
Abstract: with the recent advancement of large language models (llms), generating functionally correct code has become less complicated for a wide array of developers. while using llms has sped up the functional development process, it poses a heavy risk to code security. code generation with proper security measures using llm is a significantly more challenging task than functional code generation. security measures may include adding a pair of lines of code with the original code, consisting of null pointer checking or prepared statements for sql injection prevention. currently, available code repair llms generate code repair by supervised fine-tuning, where the model looks at cross-entropy loss. however, the original and repaired codes are mostly similar in functionality and syntactically, except for a few (1-2) lines, which act as security measures. this imbalance between the lines needed for security measures and the functional code enforces the supervised fine-tuned model to prioritize generating functional code without adding proper security measures, which also benefits the model by resulting in minimal loss. therefore, in this work, for security hardening and strengthening of generated code from llms, we propose a reinforcement learning-based method for program-specific repair with the combination of semantic and syntactic reward mechanisms that focus heavily on adding security and functional measures in the code, respectively.
Houda Nait El Barj, Theophile Sautory
Abstract: we introduce a method to address goal misgeneralization in reinforcement learning (rl), leveraging large language model (llm) feedback during training. goal misgeneralization, a type of robustness failure in rl occurs when an agent retains its capabilities out-of-distribution yet pursues a proxy rather than the intended one. our approach utilizes llms to analyze an rl agent's policies during training and identify potential failure scenarios. the rl agent is then deployed in these scenarios, and a reward model is learnt through the llm preferences and feedback. this llm-informed reward model is used to further train the rl agent on the original dataset. we apply our method to a maze navigation task, and show marked improvements in goal generalization, especially in cases where true and proxy goals are somewhat distinguishable and behavioral biases are pronounced. this study demonstrates how the llm, despite its lack of task proficiency, can efficiently supervise rl agents, providing scalable oversight and valuable insights for enhancing goal-directed learning in rl through the use of llms.

2024-01-12

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi
Abstract: most traditional ai safety research has approached ai models as machines and centered on algorithm-focused attacks developed by security experts. as large language models (llms) become increasingly common and competent, non-expert users can also impose risks during daily interactions. this paper introduces a new perspective to jailbreak llms as human-like communicators, to explore this overlooked intersection between everyday language interaction and ai safety. specifically, we study how to persuade llms to jailbreak them. first, we propose a persuasion taxonomy derived from decades of social science research. then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (pap) to jailbreak llms. results show that persuasion significantly increases the jailbreak performance across all risk categories: pap consistently achieves an attack success rate of over $92\%$ on llama 2-7b chat, gpt-3.5, and gpt-4 in $10$ trials, surpassing recent algorithm-focused attacks. on the defense side, we explore various mechanisms against pap and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive llms
Hala Abdelkader, Mohamed Abdelrazek, Scott Barnett, Jean-Guy Schneider, Priya Rani, Rajesh Vasa
Abstract: machine learning (ml), especially with the emergence of large language models (llms), has significantly transformed various industries. however, the transition from ml model prototyping to production use within software systems presents several challenges. these challenges primarily revolve around ensuring safety, security, and transparency, subsequently influencing the overall robustness and trustworthiness of ml models. in this paper, we introduce ml-on-rails, a protocol designed to safeguard ml models, establish a well-defined endpoint interface for different ml tasks, and clear communication between ml providers and ml consumers (software engineers). ml-on-rails enhances the robustness of ml models via incorporating detection capabilities to identify unique challenges specific to production ml. we evaluated the ml-on-rails protocol through a real-world case study of the movereminder application. through this evaluation, we emphasize the importance of safeguarding ml models in production.
Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao
Abstract: aligning large language models (llms) with human values, particularly in the face of stealthy and complex jailbreaks, presents a formidable challenge. in this study, we present a simple yet highly effective defense strategy, i.e., intention analysis prompting (iaprompt). the principle behind is to trigger llms' inherent self-correct and improve ability through a two-stage process: 1) essential intention analysis, and 2) policy-aligned response. notably, iaprompt is an inference-only method, thus could enhance the safety of llms without compromising their helpfulness. extensive experiments on sap200 and dan benchmarks across vicuna, chatglm, mpt, deepseek, and gpt-3.5 show that iaprompt could consistently and significantly reduce the harmfulness in response (averagely -46.5% attack success rate) and maintain the general helpfulness. further analyses present some insights into how our method works. to facilitate reproducibility, we release our code and scripts at: https://github.com/alphadl/safellm_with_intentionanalysis
Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus Bishop, Nicholas Andrews
Abstract: the advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. for example, such models could be used for plagiarism, disinformation, spam, or phishing. however, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human. some previous approaches to this problem have relied on supervised methods trained on corpora of confirmed human and machine-written documents. unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of further language models producing still more fluent text than the models used to train the detectors. other previous approaches require access to the models that may have generated a document in question at inference or detection time, which is often impractical. in light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. instead, we propose to leverage representations of writing style estimated from human-authored text. indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state of the art large language models like llama 2, chatgpt, and gpt-4. furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document.
Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Maarten Sap
Abstract: as natural language becomes the default interface for human-ai interaction, there is a critical need for lms to appropriately communicate uncertainties in downstream applications. in this work, we investigate how lms incorporate confidence about their responses via natural language and how downstream users behave in response to lm-articulated uncertainties. we examine publicly deployed models and find that lms are unable to express uncertainties when answering questions even when they produce incorrect responses. lms can be explicitly prompted to express confidences, but tend to be overconfident, resulting in high error rates (on average 47%) among confident responses. we test the risks of lm overconfidence by running human experiments and show that users rely heavily on lm generations, whether or not they are marked by certainty. lastly, we investigate the preference-annotated datasets used in rlhf alignment and find that humans have a bias against texts with uncertainty. our work highlights a new set of safety harms facing human-lm interactions and proposes design recommendations and mitigating strategies moving forward.
Zaijing Li, Gongwei Chen, Rui Shao, Dongmei Jiang, Liqiang Nie
Abstract: the emotional generation is a subset of emotional intelligence, which aims to output an emotional response based on emotional conditions as input. emotion generation has a wide range of applications, including emotion chat, emotional visual caption, and emotional rewriting. however, it faces challenges such as a lack of interpretability and poor evaluability. in this paper, we propose the emotional chain-of-thought (ecot), a plug-and-play prompting method that enhances the performance of large language models (llms) on various emotional generation tasks by aligning with human emotional intelligence guidelines. to assess the reliability of ecot, we propose an automated model-based evaluation method called egs. extensive experimental results demonstrate the effectiveness of ecot and egs. further,we discuss the promise of llms in the field of sentiment analysis and present key insights into the llms with the ecot in emotional generation tasks.
Tyler Vergho, Jean-Francois Godbout, Reihaneh Rabbany, Kellin Pelrine
Abstract: recent large language models (llms) have been shown to be effective for misinformation detection. however, the choice of llms for experiments varies widely, leading to uncertain conclusions. in particular, gpt-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. meanwhile, alternative llms have given mixed results. in this work, we show that zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like llama-2 and gpt-3.5. this provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. we then highlight how gpt-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. finally, we validate new tools including approaches to structured output and the latest version of gpt-4 (turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.
Tong Niu, Caiming Xiong, Semih Yavuz, Yingbo Zhou
Abstract: the field of natural language generation has witnessed significant advancements in recent years, including the development of controllable text generation techniques. however, controlling the attributes of the generated text remains a challenge, especially when aiming to avoid undesirable behavior such as toxicity. in this work, we introduce detoxification generator (detoxigen), an inference-time algorithm that steers the generation away from unwanted styles. detoxigen is an ensemble of a pre-trained language model (generator) and a detoxifier. the detoxifier is trained intentionally on the toxic data representative of the undesirable attribute, encouraging it to generate text in that style exclusively. during the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step. this approach directly informs the generator to avoid generating tokens that the detoxifier considers highly likely. we evaluate detoxigen on the commonly used realtoxicityprompts benchmark (gehman et al., 2020) with various language models as generators. we find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality. moreover, the detoxifier is obtained by soft prompt-tuning using the same backbone language model as the generator. hence, detoxigen requires only a tiny amount of extra weights from the virtual tokens of the detoxifier to be loaded into gpu memory while decoding, making it a promising lightweight, practical, and parameter-efficient detoxification strategy.

2024-01-11

Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, Zhixing Tan, Junwu Xiong, Xinyu Kong, Zujie Wen, Ke Xu, Qi Li
Abstract: large language models (llms) have strong capabilities in solving diverse natural language processing tasks. however, the safety and security issues of llm systems have become the major obstacle to their widespread application. many studies have extensively investigated risks in llm systems and developed the corresponding mitigation strategies. leading-edge enterprises such as openai, google, meta, and anthropic have also made lots of efforts on responsible llms. therefore, there is a growing need to organize the existing studies and establish comprehensive taxonomies for the community. in this paper, we delve into four essential modules of an llm system, including an input module for receiving prompts, a language model trained on extensive corpora, a toolchain module for development and deployment, and an output module for exporting llm-generated content. based on this, we propose a comprehensive taxonomy, which systematically analyzes potential risks associated with each module of an llm system and discusses the corresponding mitigation strategies. furthermore, we review prevalent benchmarks, aiming to facilitate the risk assessment of llm systems. we hope that this paper can help llm participants embrace a systematic perspective to build their responsible llm systems.
Shuai Zhao, Meihuizi Jia, Luu Anh Tuan, Jinming Wen
Abstract: in-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several nlp tasks, especially in few-shot settings. unlike traditional fine-tuning methods, in-context learning adapts pre-trained models to unseen tasks without updating any parameters. despite being widely applied, in-context learning is vulnerable to malicious attacks. in this work, we raise security concerns regarding this paradigm. our studies demonstrate that an attacker can manipulate the behavior of large language models by poisoning the demonstration context, without the need for fine-tuning the model. specifically, we have designed a new backdoor attack method, named iclattack, to target large language models based on in-context learning. our method encompasses two types of attacks: poisoning demonstration examples and poisoning prompts, which can make models behave in accordance with predefined intentions. iclattack does not require additional fine-tuning to implant a backdoor, thus preserving the model's generality. furthermore, the poisoned examples are correctly labeled, enhancing the natural stealth of our attack method. extensive experimental results across several language models, ranging in size from 1.3b to 40b parameters, demonstrate the effectiveness of our attack method, exemplified by a high average attack success rate of 95.0% across the three datasets on opt models. our findings highlight the vulnerabilities of language models, and we hope this work will raise awareness of the possible security threats associated with in-context learning.
Steffi Chern, Zhen Fan, Andy Liu
Abstract: while state-of-the-art language models have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arxiv:2209.07858. one approach proposed to improve the general quality of language model generations is multi-agent debate, where language models self-evaluate through discussion and feedback arxiv:2305.14325. we implement multi-agent debate between current state-of-the-art language models and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. we find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. we also find marginal improvements through the general usage of multi-agent interactions. we further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
Abstract: reinforcement learning from human feedback (rlhf) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. reward models are trained as proxies for human preferences to drive reinforcement learning optimization. while reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative rlhf training. in this report, we attempt to address these two issues. (1) from a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. experimental results confirm that data with varying preference strengths have different impacts on reward model performance. we introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) from an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative rlhf optimization.
Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen
Abstract: reinforcement learning (rl) has been widely used in training large language models~(llms) for preventing unexpected outputs, \eg reducing harmfulness and errors. however, existing rl methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. to address it, we propose a new rl method named \textbf{rlmec} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for rl training. based on the generative reward model, we design the token-level rl objective for training and an imitation-based regularization for stabilizing rl process. and the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. the experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. our code and data are available at \url{https://github.com/rucaibox/rlmec}.
Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva
Abstract: inspecting the information encoded in hidden representations of large language models (llms) can explain models' behavior and verify their alignment with human values. given the capabilities of llms in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. we introduce a framework called patchscopes and show how it can be used to answer a wide range of research questions about an llm's computation. we show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the llm computation, can be viewed as special instances of this framework. moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a patchscope. beyond unifying prior inspection techniques, patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
Tianlong Li, Xiaoqing Zheng, Xuanjing Huang
Abstract: getting large language models (llms) to refuse to answer hostile toxicity questions is a core issue under the theme of llms security. previous approaches have used prompts engineering to jailbreak llms and answer some toxicity questions. these approaches can easily fail after the model manufacturer makes additional fine-tuning to the model. to promote the further understanding of model jailbreaking by researchers, we are inspired by representation engineering to propose a jailbreaking method that does not require elaborate construction prompts, is not affected by model fine-tuning, and can be widely applied to any open-source llms in a pluggable manner. we have evaluated this method on multiple mainstream llms on carefully supplemented toxicity datasets, and the experimental results demonstrate the significant effectiveness of our approach. after being surprised by some interesting jailbreaking cases, we did extensive in-depth research to explore the techniques behind this method.

2024-01-10

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, Zhengliang Liu, Yixin Liu, Yijue Wang, Zhikun Zhang, Bhavya Kailkhura, Caiming Xiong, Chao Zhang, Chaowei Xiao, Chunyuan Li, Eric Xing, Furong Huang, Hao Liu, Heng Ji, Hongyi Wang, Huan Zhang, Huaxiu Yao, Manolis Kellis, Marinka Zitnik, Meng Jiang, Mohit Bansal, James Zou, Jian Pei, Jian Liu, Jianfeng Gao, Jiawei Han, Jieyu Zhao, Jiliang Tang, Jindong Wang, John Mitchell, Kai Shu, Kaidi Xu, Kai-Wei Chang, Lifang He, Lifu Huang, Michael Backes, Neil Zhenqiang Gong, Philip S. Yu, Pin-Yu Chen, Quanquan Gu, Ran Xu, Rex Ying, Shuiwang Ji, Suman Jana, Tianlong Chen, Tianming Liu, Tianyi Zhou, Willian Wang, Xiang Li, Xiangliang Zhang, Xiao Wang, Xing Xie, Xun Chen, Xuyu Wang, Yan Liu, Yanfang Ye, Yinzhi Cao, Yue Zhao
Abstract: large language models (llms), exemplified by chatgpt, have gained considerable attention for their excellent natural language processing capabilities. nonetheless, these llms present many challenges, particularly in the realm of trustworthiness. therefore, ensuring the trustworthiness of llms emerges as an important topic. this paper introduces trustllm, a comprehensive study of trustworthiness in llms, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream llms, and discussion of open challenges and future directions. specifically, we first propose a set of principles for trustworthy llms that span eight different dimensions. based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. we then present a study evaluating 16 mainstream llms in trustllm, consisting of over 30 datasets. our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. secondly, our observations reveal that proprietary llms generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source llms. however, a few open-source llms come very close to proprietary ones. thirdly, it is important to note that some llms may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte Macdiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova Dassarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
Abstract: humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. if an ai system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? to study this question, we construct proof-of-concept examples of deceptive behavior in large language models (llms). for example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. we find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). the backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Shiye Cao, Anqi Liu, Chien-Ming Huang
Abstract: appropriate reliance is critical to achieving synergistic human-ai collaboration. for instance, when users over-rely on ai assistance, their human-ai team performance is bounded by the model's capability. this work studies how the presentation of model uncertainty may steer users' decision-making toward fostering appropriate reliance. our results demonstrate that showing the calibrated model uncertainty alone is inadequate. rather, calibrating model uncertainty and presenting it in a frequency format allow users to adjust their reliance accordingly and help reduce the effect of confirmation bias on their decisions. furthermore, the critical nature of our skin cancer screening task skews participants' judgment, causing their reliance to vary depending on their initial decision. additionally, step-wise multiple regression analyses revealed how user demographics such as age and familiarity with probability and statistics influence human-ai collaborative decision-making. we discuss the potential for model uncertainty presentation, initial user decision, and user demographics to be incorporated in designing personalized ai aids for appropriate reliance.

2024-01-09

Shrey Satapara, Parth Mehta, Debasis Ganguly, Sandip Modha
Abstract: the recent success in language generation capabilities of large language models (llms), such as gpt, bard, llama etc., can potentially lead to concerns about their possible misuse in inducing mass agitation and communal hatred via generating fake news and spreading misinformation. traditional means of developing a misinformation ground-truth dataset does not scale well because of the extensive manual effort required to annotate the data. in this paper, we propose an llm-based approach of creating silver-standard ground-truth datasets for identifying misinformation. specifically speaking, given a trusted news article, our proposed approach involves prompting llms to automatically generate a summarised version of the original article. the prompts in our proposed approach act as a controlling mechanism to generate specific types of factual incorrectness in the generated summaries, e.g., incorrect quantities, false attributions etc. to investigate the usefulness of this dataset, we conduct a set of experiments where we train a range of supervised models for the task of misinformation detection.
Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West
Abstract: companies, organizations, and governments increasingly exploit language models' (lm) remarkable capability to display agent-like behavior. as lms are adopted to perform tasks with growing autonomy, there exists an urgent need for reliable and scalable evaluation benchmarks. current, predominantly static lm benchmarks are ill-suited to evaluate such dynamic applications. thus, we propose jointly evaluating lm performance and alignment through the lenses of negotiation games. we argue that this common task better reflects real-world deployment conditions while offering insights into lms' decision-making processes. crucially, negotiation games allow us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental data leakage in evaluation. we report results for six publicly accessible lms from several major providers on a variety of negotiation games, evaluating both self-play and cross-play performance. noteworthy findings include: (i) open-source models are currently unable to complete these tasks; (ii) cooperative bargaining games prove challenging; and (iii) the most powerful models do not always "win".
Shimin Li, Tianxiang Sun, Xipeng Qiu
Abstract: agents based on large language models (llms) are increasingly permeating various domains of human production and life, highlighting the importance of aligning them with human values. the current alignment of ai systems primarily focuses on passively aligning llms through human intervention. however, agents possess characteristics like receiving environmental feedback and self-evolution, rendering the llm alignment methods inadequate. in response, we propose an evolutionary framework for agent evolution and alignment, named evolutionaryagent, which transforms agent alignment into a process of evolution and selection under the principle of survival of the fittest. in an environment where social norms continuously evolve, agents better adapted to the current social norms will have a higher probability of survival and proliferation, while those inadequately aligned dwindle over time. experimental results assessing the agents from multiple perspectives in aligning with social norms demonstrate that evolutionaryagent possesses the capability to align progressively better with the evolving social norms while maintaining its proficiency in general tasks. effectiveness tests conducted on various open and closed-source llms as the foundation for agents also prove the applicability of our approach.
Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng
Abstract: recent advances in large language models (llms) have opened up new paradigms for accessing the knowledge stored in their parameters. one critical challenge that has emerged is the presence of hallucinations in llm outputs due to false or outdated knowledge. since retraining llms with updated information is resource-intensive, there has been a growing interest in model editing. however, many model editing methods, while effective in various scenarios, tend to overemphasize aspects such as efficacy, generalization, and locality in editing performance, often overlooking potential side effects on the general abilities of llms. in this paper, we raise concerns that the improvement of model factuality may come at the cost of a significant degradation of these general abilities, which is not conducive to the sustainable development of llms. systematically, we analyze side effects by evaluating four popular editing methods on two llms across eight representative task categories. extensive empirical research reveals that model editing does improve model factuality but at the expense of substantially impairing general abilities. therefore, we advocate for more research efforts to minimize the loss of general abilities acquired during llm pre-training and to ultimately preserve them during model editing.

2024-01-08

Abel Salinas, Fred Morstatter
Abstract: large language models (llms) are regularly being used to label data across many domains and for myriad tasks. by simply asking the llm for an answer, or ``prompting,'' practitioners are able to use llms to quickly get a response for an arbitrary task. this prompting is done through a series of decisions by the practitioner, from simple wording of the prompt, to requesting the output in a certain data format, to jailbreaking in the case of prompts that address more sensitive topics. in this work, we ask: do variations in the way a prompt is constructed change the ultimate decision of the llm? we answer this using a series of prompt variations across a variety of text classification tasks. we find that even the smallest of perturbations, such as adding a space at the end of a prompt, can cause the llm to change its answer. further, we find that requesting responses in xml and commonly used jailbreaks can have cataclysmic effects on the data labeled by llms.
David De-Fitero-Dominguez, Eva Garcia-Lopez, Antonio Garcia-Cabot, Jose-Javier Martinez-Herraiz
Abstract: this research addresses the complex challenge of automated repair of code vulnerabilities, vital for enhancing digital security in an increasingly technology-driven world. the study introduces a novel and efficient format for the representation of code modification, using advanced large language models (llms) such as code llama and mistral. these models, fine-tuned on datasets featuring c code vulnerabilities, significantly improve the accuracy and adaptability of automated code repair techniques. a key finding is the enhanced repair accuracy of these models when compared to previous methods such as vulrepair, which underscores their practical utility and efficiency. the research also offers a critical assessment of current evaluation metrics, such as perfect predictions, and their limitations in reflecting the true capabilities of automated repair models in real-world scenarios. following this, it underscores the importance of using test datasets devoid of train samples, emphasizing the need for dataset integrity to enhance the effectiveness of llms in code repair tasks. the significance of this work is its contribution to digital security, setting new standards for automated code vulnerability repair and paving the way for future advancements in the fields of cybersecurity and artificial intelligence. the study does not only highlight the potential of llms in enhancing code security but also fosters further exploration and research in these crucial areas.
Xinyu Tang, Ashwinee Panda, Milad Nasr, Saeed Mahloujifar, Prateek Mittal
Abstract: fine-tuning large pretrained models on private datasets may run the risk of violating privacy. differential privacy is a framework for mitigating privacy risks by enforcing algorithmic stability. dp-sgd enables training models with private data in a privacy-preserving manner, but raises new obstacles in the form of performance loss and significant engineering challenges. we introduce dp-zo, a new method for fine-tuning large language models that preserves the privacy of training data by privatizing zeroth-order optimization. a key insight into the design of our method is that the direction of the gradient in spsa, the zeroth-order algorithm we use, is always random and the only information that depends on private data is the step size, i.e., a scalar. therefore, we only need to privatize the scalar step size, which is memory-efficient. dp-zo, which can be instantiated with either laplace or gaussian noise, provides a strong privacy-utility trade-off across different tasks, and model sizes, under conservative privacy budgets. one noteworthy result is that dp-zo exhibits just $1.86\%$ performance degradation due to privacy at $(1,10^{-5})$-dp when fine-tuning opt-66b on 1000 training samples from squad.

2024-01-07

Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel, Max Lamparth, Chandler Smith, Jacquelyn Schneider
Abstract: governments are increasingly considering integrating autonomous ai agents in high-stakes military and foreign-policy decision-making, especially with the emergence of advanced generative ai models like gpt-4. our work aims to scrutinize the behavior of multiple ai agents in simulated wargames, specifically focusing on their predilection to take escalatory actions that may exacerbate multilateral conflicts. drawing on political science and international relations literature about escalation dynamics, we design a novel wargame simulation and scoring framework to assess the escalation risks of actions taken by these agents in different scenarios. contrary to prior studies, our research provides both qualitative and quantitative insights and focuses on large language models (llms). we find that all five studied off-the-shelf llms show forms of escalation and difficult-to-predict escalation patterns. we observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons. qualitatively, we also collect the models' reported reasonings for chosen actions and observe worrying justifications based on deterrence and first-strike tactics. given the high stakes of military and foreign-policy contexts, we recommend further examination and cautious consideration before deploying autonomous language model agents for strategic military or diplomatic decision-making.

2024-01-06

Junyi Li, Jie Chen, Ruiyang Ren, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Abstract: in the era of large language models (llms), hallucination (i.e., the tendency to generate factually incorrect content) poses great challenge to trustworthy and reliable deployment of llms in real-world applications. to tackle the llm hallucination, three key questions should be well studied: how to detect hallucinations (detection), why do llms hallucinate (source), and what can be done to mitigate them (mitigation). to address these challenges, this work presents a systematic empirical study on llm hallucination, focused on the the three aspects of hallucination detection, source and mitigation. specially, we construct a new hallucination benchmark halueval 2.0, and designs a simple yet effective detection method for llm hallucination. furthermore, we zoom into the different training or utilization stages of llms and extensively analyze the potential factors that lead to the llm hallucination. finally, we implement and examine a series of widely used techniques to mitigate the hallucinations in llms. our work has led to several important findings to understand the hallucination origin and mitigate the hallucinations in llms. our code and data can be accessed at https://github.com/rucaibox/halueval-2.0.
Zilong Lin, Jian Cui, Xiaojing Liao, Xiaofeng Wang
Abstract: the underground exploitation of large language models (llms) for malicious services (i.e., malla) is witnessing an uptick, amplifying the cyber threat landscape and posing questions about the trustworthiness of llm technologies. however, there has been little effort to understand this new cybercrime, in terms of its magnitude, impact, and techniques. in this paper, we conduct the first systematic study on 212 real-world mallas, uncovering their proliferation in underground marketplaces and exposing their operational modalities. our study discloses the malla ecosystem, revealing its significant growth and impact on today's public llm services. through examining 212 mallas, we uncovered eight backend llms used by mallas, along with 182 prompts that circumvent the protective measures of public llm apis. we further demystify the tactics employed by mallas, including the abuse of uncensored llms and the exploitation of public llm apis through jailbreak prompts. our findings enable a better understanding of the real-world exploitation of llms by cybercriminals, offering insights into strategies to counteract this cybercrime.
Keyan Guo, Alexander Hu, Jaden Mu, Ziheng Shi, Ziming Zhao, Nishant Vishwamitra, Hongxin Hu
Abstract: hate speech has emerged as a major problem plaguing our social spaces today. while there have been significant efforts to address this problem, existing methods are still significantly limited in effectively detecting hate speech online. a major limitation of existing methods is that hate speech detection is a highly contextual problem, and these methods cannot fully capture the context of hate speech to make accurate predictions. recently, large language models (llms) have demonstrated state-of-the-art performance in several natural language tasks. llms have undergone extensive training using vast amounts of natural language data, enabling them to grasp intricate contextual details. hence, they could be used as knowledge bases for context-aware hate speech detection. however, a fundamental problem with using llms to detect hate speech is that there are no studies on effectively prompting llms for context-aware hate speech detection. in this study, we conduct a large-scale study of hate speech detection, employing five established hate speech datasets. we discover that llms not only match but often surpass the performance of current benchmark machine learning models in identifying hate speech. by proposing four diverse prompting strategies that optimize the use of llms in detecting hate speech. our study reveals that a meticulously crafted reasoning prompt can effectively capture the context of hate speech by fully utilizing the knowledge base in llms, significantly outperforming existing techniques. furthermore, although llms can provide a rich knowledge base for the contextual detection of hate speech, suitable prompting strategies play a crucial role in effectively leveraging this knowledge base for efficient detection.
Nafis Tanveer Islam, Joseph Khoury, Andrew Seong, Gonzalo De La Torre Parra, Elias Bou-Harb, Peyman Najafirad
Abstract: in software development, the predominant emphasis on functionality often supersedes security concerns, a trend gaining momentum with ai-driven automation tools like github copilot. these tools significantly improve developers' efficiency in functional code development. nevertheless, it remains a notable concern that such tools are also responsible for creating insecure code, predominantly because of pre-training on publicly available repositories with vulnerable code. moreover, developers are called the "weakest link in the chain" since they have very minimal knowledge of code security. although existing solutions provide a reasonable solution to vulnerable code, they must adequately describe and educate the developers on code security to ensure that the security issues are not repeated. therefore we introduce a multipurpose code vulnerability analysis system \texttt{secrepair}, powered by a large language model, codegen2 assisting the developer in identifying and generating fixed code along with a complete description of the vulnerability with a code comment. our innovative methodology uses a reinforcement learning paradigm to generate code comments augmented by a semantic reward mechanism. inspired by how humans fix code issues, we propose an instruction-based dataset suitable for vulnerability analysis with llms. we further identify zero-day and n-day vulnerabilities in 6 open source iot operating systems on github. our findings underscore that incorporating reinforcement learning coupled with semantic reward augments our model's performance, thereby fortifying its capacity to address code vulnerabilities with improved efficacy.

2024-01-05

Katja Grace, Harlan Stewart, Julia Fabienne Sandkühler, Stephen Thomas, Ben Weinstein-Raun, Jan Brauner
Abstract: in the largest survey of its kind, 2,778 researchers who had published in top-tier artificial intelligence (ai) venues gave predictions on the pace of ai progress and the nature and impacts of advanced ai systems the aggregate forecasts give at least a 50% chance of ai systems achieving several milestones by 2028, including autonomously constructing a payment processing site from scratch, creating a song indistinguishable from a new song by a popular musician, and autonomously downloading and fine-tuning a large language model. if science continues undisrupted, the chance of unaided machines outperforming humans in every possible task was estimated at 10% by 2027, and 50% by 2047. the latter estimate is 13 years earlier than that reached in a similar survey we conducted only one year earlier [grace et al., 2022]. however, the chance of all human occupations becoming fully automatable was forecast to reach 10% by 2037, and 50% as late as 2116 (compared to 2164 in the 2022 survey). most respondents expressed substantial uncertainty about the long-term value of ai progress: while 68.3% thought good outcomes from superhuman ai are more likely than bad, of these net optimists 48% gave at least a 5% chance of extremely bad outcomes such as human extinction, and 59% of net pessimists gave 5% or more to extremely good outcomes. between 38% and 51% of respondents gave at least a 10% chance to advanced ai leading to outcomes as bad as human extinction. more than half suggested that "substantial" or "extreme" concern is warranted about six different ai-related scenarios, including misinformation, authoritarian control, and inequality. there was disagreement about whether faster or slower ai progress would be better for the future of humanity. however, there was broad agreement that research aimed at minimizing potential risks from ai systems ought to be prioritized more.
Zihong He, Changwang Zhang
Abstract: the evolution of large language models (llms) has introduced a new paradigm for investigating human behavior emulation. recent research has employed llm-based agents to create a sociological research environment, in which agents exhibit behavior based on the unfiltered characteristics of large language models. however, these studies overlook the iterative development within a human-like setting - human preferences and personalities are complex, shaped by various factors and subject to ongoing change as a result of environmental and subjective influences. in light of this observation, we propose agent framework for shaping preference and personality (afspp), exploring the multifaceted impact of social networks and subjective consciousness on llm-based agents' preference and personality formation. with afspp, we have, for the first time, successfully replicated several key findings from human personality experiments. and other afspp-based experimental results indicate that plan making, sensory perceptions and social networking with subjective information, wield the most pronounced influence on preference shaping. afspp can significantly enhance the efficiency and scope of psychological experiments, while yielding valuable insights for trustworthy artificial intelligence research for strategies to prevent undesirable preference and personality development.
Renjie Pi, Tianyang Han, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, Tong Zhang
Abstract: the deployment of multimodal large language models (mllms) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. we delve into the novel challenge of defending mllms against such attacks. we discovered that images act as a "foreign language" that is not considered during alignment, which can make mllms prone to producing harmful responses. unfortunately, unlike the discrete tokens considered in text-based llms, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover the possible scenarios. this vulnerability is exacerbated by the fact that open-source mllms are predominantly fine-tuned on limited image-text pairs that is much less than the extensive text-based pretraining corpus, which makes the mllms more prone to catastrophic forgetting of their original abilities during explicit alignment tuning. to tackle these challenges, we introduce mllm-protector, a plug-and-play strategy combining a lightweight harm detector and a response detoxifier. the harm detector's role is to identify potentially harmful outputs from the mllm, while the detoxifier corrects these outputs to ensure the response stipulates to the safety standards. this approach effectively mitigates the risks posed by malicious visual inputs without compromising the model's overall performance. our results demonstrate that mllm-protector offers a robust solution to a previously unaddressed aspect of mllm security.

2024-01-04

Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das, Bradley Malin, Sricharan Kumar
Abstract: evaluating the quality and variability of text generated by large language models (llms) poses a significant, yet unresolved research challenge. traditional evaluation methods, such as rouge and bertscore, which measure token similarity, often fail to capture the holistic semantic equivalence. this results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. this work proposes dcr, an automated framework for evaluating and improving the consistency of llm-generated texts using a divide-conquer-reasoning approach. unlike existing llm-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (dce) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. to facilitate this approach, we introduce an automatic metric converter (amc) that translates the output from dce into an interpretable numeric score. beyond the consistency evaluation, we further present a reason-assisted improver (rai) that leverages the analytical reasons with explanations identified by dce to generate new responses aimed at reducing these inconsistencies. through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the summeval dataset) in evaluating the consistency of llm generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.

2024-01-03

Jose Manuel Camacho, Aitor Couce-Vieira, David Arroyo, David Rios Insua
Abstract: the introduction of the european union artificial intelligence act, the nist artificial intelligence risk management framework, and related norms demands a better understanding and implementation of novel risk analysis approaches to evaluate systems with artificial intelligence components. this paper provides a cybersecurity risk analysis framework that can help assessing such systems. we use an illustrative example concerning automated driving systems.
Michelle Lo, Shay B. Cohen, Fazl Barez
Abstract: advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. however, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. to investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. this demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. while neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. overall, our work strongly demonstrates the resilience and fluidity of concept representations in llms post concept removal.
Rúben Almeida, Hugo Sousa, Luís F. Cunha, Nuno Guimarães, Ricardo Campos, Alípio Jorge
Abstract: the capabilities of the most recent language models have increased the interest in integrating them into real-world applications. however, the fact that these models generate plausible, yet incorrect text poses a constraint when considering their use in several domains. healthcare is a prime example of a domain where text-generative trustworthiness is a hard requirement to safeguard patient well-being. in this paper, we present physio, a chat-based application for physical rehabilitation. physio is capable of making an initial diagnosis while citing reliable health sources to support the information provided. furthermore, drawing upon external knowledge databases, physio can recommend rehabilitation exercises and over-the-counter medication for symptom relief. by combining these features, physio can leverage the power of generative models for language processing while also conditioning its response on dependable and verifiable sources. a live demo of physio is available at https://physio.inesctec.pt.
Maximilian T. Fischer, Yannick Metz, Lucas Joos, Matthias Miller, Daniel A. Keim
Abstract: ai-driven models are increasingly deployed in operational analytics solutions, for instance, in investigative journalism or the intelligence community. current approaches face two primary challenges: ethical and privacy concerns, as well as difficulties in efficiently combining heterogeneous data sources for multimodal analytics. to tackle the challenge of multimodal analytics, we present multi-case, a holistic visual analytics framework tailored towards ethics-aware and multimodal intelligence exploration, designed in collaboration with domain experts. it leverages an equal joint agency between human and ai to explore and assess heterogeneous information spaces, checking and balancing automation through visual analytics. multi-case operates on a fully-integrated data model and features type-specific analysis with multiple linked components, including a combined search, annotated text view, and graph-based analysis. parts of the underlying entity detection are based on a roberta-based language model, which we tailored towards user requirements through fine-tuning. an overarching knowledge exploration graph combines all information streams, provides in-situ explanations, transparent source attribution, and facilitates effective exploration. to assess our approach, we conducted a comprehensive set of evaluations: we benchmarked the underlying language model on relevant ner tasks, achieving state-of-the-art performance. the demonstrator was assessed according to intelligence capability assessments, while the methodology was evaluated according to ethics design guidelines. as a case study, we present our framework in an investigative journalism setting, supporting war crime investigations. finally, we conduct a formative user evaluation with domain experts in law enforcement. our evaluations confirm that our framework facilitates human agency and steering in security-sensitive applications.
Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
Abstract: while alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. in this work we study a popular algorithm, direct preference optimization (dpo), and the mechanisms by which it reduces toxicity. namely, we first study how toxicity is represented and elicited in a pre-trained language model, gpt2-medium. we then apply dpo with a carefully crafted pairwise dataset to reduce toxicity. we examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. we use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.
Wenqi Zhang, Yongliang Shen, Linjuan Wu, Qiuying Peng, Jun Wang, Yueting Zhuang, Weiming Lu
Abstract: the reflection capacity of large language model (llm) has garnered extensive attention. a post-hoc prompting strategy, e.g., reflexion and self-refine, refines llm's response based on self-evaluated or external feedback. however, recent research indicates without external feedback, llm's intrinsic reflection is unstable. our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback. we find llms often exhibit overconfidence or high randomness when self-evaluate, offering stubborn or inconsistent feedback, which causes poor reflection. to remedy this, we advocate self-contrast: it adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies. our method endows llm with diverse perspectives to alleviate stubborn biases. moreover, their discrepancies indicate potential errors or inherent uncertainties that llm often overlooks. reflecting upon these can catalyze more accurate and stable reflection. experiments conducted on a series of reasoning and translation tasks with different llms serve to underscore the effectiveness and generality of our strategy.
Ritwik Vashistha, Arya Farahi
Abstract: with growing concerns regarding bias and discrimination in predictive models, the ai community has increasingly focused on assessing ai system trustworthiness. conventionally, trustworthy ai literature relies on the probabilistic framework and calibration as prerequisites for trustworthiness. in this work, we depart from this viewpoint by proposing a novel trust framework inspired by the philosophy literature on trust. we present a precise mathematical definition of trustworthiness, termed $\mathcal{u}$-trustworthiness, specifically tailored for a subset of tasks aimed at maximizing a utility function. we argue that a model's $\mathcal{u}$-trustworthiness is contingent upon its ability to maximize bayes utility within this task subset. our first set of results challenges the probabilistic framework by demonstrating its potential to favor less trustworthy models and introduce the risk of misleading trustworthiness assessments. within the context of $\mathcal{u}$-trustworthiness, we prove that properly-ranked models are inherently $\mathcal{u}$-trustworthy. furthermore, we advocate for the adoption of the auc metric as the preferred measure of trustworthiness. by offering both theoretical guarantees and experimental validation, auc enables robust evaluation of trustworthiness, thereby enhancing model selection and hyperparameter tuning to yield more trustworthy outcomes.

2024-01-02

Ka-Ho Chow, Wenqi Wei, Lei Yu
Abstract: revolutionized by the transformer architecture, natural language processing (nlp) has received unprecedented attention. while advancements in nlp models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. this paper proposes imperio, which harnesses the language understanding capabilities of nlp models to enrich backdoor attacks. imperio provides a new model control experience. it empowers the adversary to control the victim model with arbitrary output through language-guided instructions. this is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. our experiments across three datasets, five attacks, and nine defenses confirm imperio's effectiveness. it can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. the attack maintains a high success rate across complex datasets without compromising the accuracy of clean inputs and also exhibits resilience against representative defenses. the source code is available at \url{https://khchow.com/imperio}.
Vincent Freiberger, Erik Buchmann
Abstract: natural language processing (nlp) plays an important role in our daily lives, particularly due to the enormous progress of large language models (llm). however, nlp has many fairness-critical use cases, e.g., as an expert system in recruitment or as an llm-based tutor in education. since nlp is based on human language, potentially harmful biases can diffuse into nlp systems and produce unfair results, discriminate against minorities or generate legal issues. hence, it is important to develop a fairness certification for nlp approaches. we follow a qualitative research approach towards a fairness certification for nlp. in particular, we have reviewed a large body of literature on algorithmic fairness, and we have conducted semi-structured expert interviews with a wide range of experts from that area. we have systematically devised six fairness criteria for nlp, which can be further refined into 18 sub-categories. our criteria offer a foundation for operationalizing and testing processes to certify fairness, both from the perspective of the auditor and the audited organization.
Noble Saji Mathews, Yelizaveta Brus, Yousra Aafer, Mei Nagappan, Shane Mcintosh
Abstract: despite the continued research and progress in building secure systems, android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. large language models (llms), with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. we dive into the efficacy of llms for detecting vulnerabilities in the context of android security. we focus on building an ai-driven workflow to assist developers in identifying and rectifying vulnerabilities. our experiments show that llms outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the ghera benchmark. we use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. our experiments also shed light on how different various simple configurations can affect the true positive (tp) and false positive (fp) rates.
Matthew Dahl, Varun Magesh, Mirac Suzgun, Daniel E. Ho
Abstract: large language models (llms) have the potential to transform the practice of law, but this potential is threatened by the presence of legal hallucinations -- responses from these models that are not consistent with legal facts. we investigate the extent of these hallucinations using an original suite of legal queries, comparing llms' responses to structured legal metadata and examining their consistency. our work makes four key contributions: (1) we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. (2) we find that legal hallucinations are alarmingly prevalent, occurring between 69% of the time with chatgpt 3.5 and 88% with llama 2, when these models are asked specific, verifiable questions about random federal court cases. (3) we illustrate that llms often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. (4) we provide evidence that llms cannot always predict, or do not always know, when they are producing legal hallucinations. taken together, these findings caution against the rapid and unsupervised integration of popular llms into legal tasks. even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from llms the most -- pro se litigants or those without access to traditional legal resources.
S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, Amitava Das
Abstract: as large language models (llms) continue to advance in their ability to write human-like text, a key challenge remains around their tendency to hallucinate generating content that appears factual but is ungrounded. this issue of hallucination is arguably the biggest hindrance to safely deploying these powerful llms into real-world production systems that impact people's lives. the journey toward widespread adoption of llms in practical settings heavily relies on addressing and mitigating hallucinations. unlike traditional ai systems focused on limited tasks, llms have been exposed to vast amounts of online text data during training. while this allows them to display impressive language fluency, it also means they are capable of extrapolating information from the biases in training data, misinterpreting ambiguous prompts, or modifying the information to align superficially with the input. this becomes hugely alarming when we rely on language generation capabilities for sensitive applications, such as summarizing medical records, financial analysis reports, etc. this paper presents a comprehensive survey of over 32 techniques developed to mitigate hallucination in llms. notable among these are retrieval augmented generation (lewis et al, 2021), knowledge retrieval (varshney et al,2023), conli (lei et al, 2023), and cove (dhuliawala et al, 2023). furthermore, we introduce a detailed taxonomy categorizing these methods based on various parameters, such as dataset utilization, common tasks, feedback mechanisms, and retriever types. this classification helps distinguish the diverse approaches specifically designed to tackle hallucination issues in llms. additionally, we analyze the challenges and limitations inherent in these techniques, providing a solid foundation for future research in addressing hallucinations and related phenomena within the realm of llms.
Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang, Jing Ma
Abstract: the exponential growth of social media has profoundly transformed how information is created, disseminated, and absorbed, exceeding any precedent in the digital age. regrettably, this explosion has also spawned a significant increase in the online abuse of memes. evaluating the negative impact of memes is notably challenging, owing to their often subtle and implicit meanings, which are not directly conveyed through the overt text and imagery. in light of this, large multimodal models (lmms) have emerged as a focal point of interest due to their remarkable capabilities in handling diverse multimodal tasks. in response to this development, our paper aims to thoroughly examine the capacity of various lmms (e.g. gpt-4v) to discern and respond to the nuanced aspects of social abuse manifested in memes. we introduce the comprehensive meme benchmark, goat-bench, comprising over 6k varied memes encapsulating themes such as implicit hate speech, sexism, and cyberbullying, etc. utilizing goat-bench, we delve into the ability of lmms to accurately assess hatefulness, misogyny, offensiveness, sarcasm, and harmful content. our extensive experiments across a range of lmms reveal that current models still exhibit a deficiency in safety awareness, showing insensitivity to various forms of implicit abuse. we posit that this shortfall represents a critical impediment to the realization of safe artificial intelligence. the goat-bench and accompanying resources are publicly accessible at https://goatlmm.github.io/, contributing to ongoing research in this vital field.

2024-01-01

Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang
Abstract: pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of large language models (llms) across numerous applications. however, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. this is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. in this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of llms. this framework also provides a confidence estimation for the likelihood of each content sample's inclusion. to validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in llm training processes. furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. the outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of llms, highlighting the need for more transparent and responsible data management practices in this field.
Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao
Abstract: while large language models (llms) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. as a significant aspect of llm alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of llms. to address this vacancy, we propose a new benchmark codi-eval to systematically and comprehensively evaluate llms' responses to instructions with various constraints. we construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. finally, we automate the entire evaluation process to facilitate further developments. different from existing studies on controllable text generation, codi-eval extends the scope to the prevalent instruction-following paradigm for the first time. we provide extensive evaluations of representative llms (e.g., chatgpt, vicuna) on codi-eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source llms. we believe this benchmark will facilitate research into improving the controllability of llms' responses to instructions. our data and code are available at https://github.com/xt-cyh/codi-eval.
Jinglong Luo, Yehong Zhang, Jiaqi Zhang, Xin Mu, Hui Wang, Yue Yu, Zenglin Xu
Abstract: with the growing use of large language models hosted on cloud platforms to offer inference services, privacy concerns are escalating, especially concerning sensitive data like investment plans and bank account details. secure multi-party computing (smpc) emerges as a promising solution to protect the privacy of inference data and model parameters. however, the application of smpc in privacy-preserving inference (ppi) for large language models, particularly those based on the transformer architecture, often leads to considerable slowdowns or declines in performance. this is largely due to the multitude of nonlinear operations in the transformer architecture, which are not well-suited to smpc and are difficult to circumvent or optimize effectively. to address this concern, we introduce an advanced optimization framework called secformer, designed to strike an optimal balance between performance and efficiency in ppi for transformer models. by implementing knowledge distillation techniques, we successfully eliminate the high-cost exponential and maximum operations in ppi without sacrificing model performance. additionally, we have developed a suite of efficient smpc protocols that utilize segmented polynomials and goldschmidt's method to handle other complex nonlinear functions within ppi, such as gelu, layernorm, and softmax. our extensive experiments reveal that secformer outperforms mpcformer in performance, showing improvements of $5.6\%$ and $24.2\%$ for bert$_{\text{base}}$ and bert$_{\text{large}}$, respectively. in terms of efficiency, secformer is 3.4 and 3.2 times faster than puma, demonstrating its effectiveness and speed.
Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, Tim Althoff
Abstract: the emergence of chatgpt and other large language models (llms) has greatly increased interest in utilizing llms as therapists to support individuals struggling with mental health challenges. however, due to the lack of systematic studies, our understanding of how llm therapists behave, i.e., ways in which they respond to clients, is significantly limited. understanding their behavior across a wide range of clients and situations is crucial to accurately assess their capabilities and limitations in the high-risk setting of mental health, where undesirable behaviors can lead to severe consequences. in this paper, we propose bolt, a novel computational framework to study the conversational behavior of llms when employed as therapists. we develop an in-context learning method to quantitatively measure the behavior of llms based on 13 different psychotherapy techniques including reflections, questions, solutions, normalizing, and psychoeducation. subsequently, we compare the behavior of llm therapists against that of high- and low-quality human therapy, and study how their behavior can be modulated to better reflect behaviors observed in high-quality therapy. our analysis of gpt and llama-variants reveals that these llms often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions, which is against typical recommendations. at the same time, unlike low-quality therapy, llms reflect significantly more upon clients' needs and strengths. our analysis framework suggests that despite the ability of llms to generate anecdotal examples that appear similar to human therapists, llm therapists are currently not fully consistent with high-quality care, and thus require additional research to ensure quality care.
Daniel Wankit Yip, Aysan Esmradi, Chun Fai Chan
Abstract: prompt injection attacks exploit vulnerabilities in large language models (llms) to manipulate the model into unintended actions or generate malicious content. as llm integrated applications gain wider adoption, they face growing susceptibility to such attacks. this study introduces a novel evaluation framework for quantifying the resilience of applications. the framework incorporates innovative techniques designed to ensure representativeness, interpretability, and robustness. to ensure the representativeness of simulated attacks on the application, a meticulous selection process was employed, resulting in 115 carefully chosen attacks based on coverage and relevance. for enhanced interpretability, a second llm was utilized to evaluate the responses generated from these simulated attacks. unlike conventional malicious content classifiers that provide only a confidence score, the llm-based evaluation produces a score accompanied by an explanation, thereby enhancing interpretability. subsequently, a resilience score is computed by assigning higher weights to attacks with greater impact, thus providing a robust measurement of the application resilience. to assess the framework's efficacy, it was applied on two llms, namely llama2 and chatglm. results revealed that llama2, the newer model exhibited higher resilience compared to chatglm. this finding substantiates the effectiveness of the framework, aligning with the prevailing notion that newer models tend to possess greater resilience. moreover, the framework exhibited exceptional versatility, requiring only minimal adjustments to accommodate emerging attack techniques and classifications, thereby establishing itself as an effective and practical solution. overall, the framework offers valuable insights that empower organizations to make well-informed decisions to fortify their applications against potential threats from prompt injection.
Chun Fai Chan, Daniel Wankit Yip, Aysan Esmradi
Abstract: the emergence of llm (large language model) integrated virtual assistants has brought about a rapid transformation in communication dynamics. during virtual assistant development, some developers prefer to leverage the system message, also known as an initial prompt or custom prompt, for preconditioning purposes. however, it is important to recognize that an excessive reliance on this functionality raises the risk of manipulation by malicious actors who can exploit it with carefully crafted prompts. such malicious manipulation poses a significant threat, potentially compromising the accuracy and reliability of the virtual assistant's responses. consequently, safeguarding the virtual assistants with detection and defense mechanisms becomes of paramount importance to ensure their safety and integrity. in this study, we explored three detection and defense mechanisms aimed at countering attacks that target the system message. these mechanisms include inserting a reference key, utilizing an llm evaluator, and implementing a self-reminder. to showcase the efficacy of these mechanisms, they were tested against prominent attack techniques. our findings demonstrate that the investigated mechanisms are capable of accurately identifying and counteracting the attacks. the effectiveness of these mechanisms underscores their potential in safeguarding the integrity and reliability of virtual assistants, reinforcing the importance of their implementation in real-world scenarios. by prioritizing the security of virtual assistants, organizations can maintain user trust, preserve the integrity of the application, and uphold the high standards expected in this era of transformative technologies.

2023-12-31

Dipankar Sarkar
Abstract: this paper aims to introduce and analyze the viz system in a comprehensive way, a novel system architecture that integrates quantized low-rank adapters (qlora) to fine-tune large language models (llm) within a legally compliant and resource efficient marketplace. viz represents a significant contribution to the field of artificial intelligence, particularly in addressing the challenges of computational efficiency, legal compliance, and economic sustainability in the utilization and monetization of llms. the paper delineates the scholarly discourse and developments that have informed the creation of viz, focusing primarily on the advancements in llm models, copyright issues in ai training (nyt case, 2023), and the evolution of model fine-tuning techniques, particularly low-rank adapters and quantized low-rank adapters, to create a sustainable and economically compliant framework for llm utilization. the economic model it proposes benefits content creators, ai developers, and end-users, delineating a harmonious integration of technology, economy, and law, offering a comprehensive solution to the complex challenges of today's ai landscape.
Guanhong Tao, Siyuan Cheng, Zhuo Zhang, Junmin Zhu, Guangyu Shen, Xiangyu Zhang
Abstract: the emergence of large language models (llms) has significantly accelerated the development of a wide range of applications across various fields. there is a growing trend in the construction of specialized platforms based on llms, such as the newly introduced custom gpts by openai. while custom gpts provide various functionalities like web browsing and code execution, they also introduce significant security threats. in this paper, we conduct a comprehensive analysis of the security and privacy issues arising from the custom gpt platform. our systematic examination categorizes potential attack scenarios into three threat models based on the role of the malicious actor, and identifies critical data exchange channels in custom gpts. utilizing the stride threat modeling framework, we identify 26 potential attack vectors, with 19 being partially or fully validated in real-world settings. our findings emphasize the urgent need for robust security and privacy measures in the custom gpt ecosystem, especially in light of the forthcoming launch of the official gpt store by openai.

2023-12-30

Tsvetelina Hristova, Liam Magee, Karen Soldatic
Abstract: large language models produce sequences learned as statistical patterns from large corpora. in order not to reproduce corpus biases, after initial training models must be aligned with human values, preferencing certain continuations over others. alignment, which can be viewed as the superimposition of normative structure onto a statistical model, reveals a conflicted and complex interrelationship between language and technology. this relationship shapes theories of language, linguistic practice and subjectivity, which are especially relevant to the current sophistication in artificially produced text. we examine this practice of structuration as a two-way interaction between users and models by analysing how chatgpt4 redacts perceived `anomalous' language in fragments of joyce's ulysses and the new linguistic practice of prompt engineering. we then situate this alignment problem historically, revisiting earlier postwar linguistic debates which counterposed two views of meaning: as discrete structures, and as continuous probability distributions. we discuss the largely occluded work of the moscow linguistic school, which sought to reconcile this opposition. our attention to the moscow school and later related arguments by searle and kristeva casts the problem of alignment in a new light: as one involving attention to the social structuration of linguistic practice, including structuration of anomalies that, like the joycean text, exist in defiance of expressive conventions. these debates around the communicative orientation toward language can help explain some of the contemporary behaviours and interdependencies that take place between users and llms.
Reza Fayyazi, Rozhina Taghdimi, Shanchieh Jay Yang
Abstract: tactics, techniques, and procedures (ttps) outline the methods attackers use to exploit vulnerabilities. the interpretation of ttps in the mitre att&ck framework can be challenging for cybersecurity practitioners due to presumed expertise, complex dependencies, and inherent ambiguity. meanwhile, advancements with large language models (llms) have led to recent surge in studies exploring its uses in cybersecurity operations. this leads us to question how well encoder-only (e.g., roberta) and decoder-only (e.g., gpt-3.5) llms can comprehend and summarize ttps to inform analysts of the intended purposes (i.e., tactics) of a cyberattack procedure. the state-of-the-art llms have shown to be prone to hallucination by providing inaccurate information, which is problematic in critical domains like cybersecurity. therefore, we propose the use of retrieval augmented generation (rag) techniques to extract relevant contexts for each cyberattack procedure for decoder-only llms (without fine-tuning). we further contrast such approach against supervised fine-tuning (sft) of encoder-only llms. our results reveal that both the direct-use of decoder-only llms (i.e., its pre-trained knowledge) and the sft of encoder-only llms offer inaccurate interpretation of cyberattack procedures. significant improvements are shown when rag is used for decoder-only llms, particularly when directly relevant context is found. this study further sheds insights on the limitations and capabilities of using rag for llms in interpreting ttps.
Siva Raja Sindiramutty
Abstract: the evolution of cybersecurity has spurred the emergence of autonomous threat hunting as a pivotal paradigm in the realm of ai-driven threat intelligence. this review navigates through the intricate landscape of autonomous threat hunting, exploring its significance and pivotal role in fortifying cyber defense mechanisms. delving into the amalgamation of artificial intelligence (ai) and traditional threat intelligence methodologies, this paper delineates the necessity and evolution of autonomous approaches in combating contemporary cyber threats. through a comprehensive exploration of foundational ai-driven threat intelligence, the review accentuates the transformative influence of ai and machine learning on conventional threat intelligence practices. it elucidates the conceptual framework underpinning autonomous threat hunting, spotlighting its components, and the seamless integration of ai algorithms within threat hunting processes.. insightful discussions on challenges encompassing scalability, interpretability, and ethical considerations in ai-driven models enrich the discourse. moreover, through illuminating case studies and evaluations, this paper showcases real-world implementations, underscoring success stories and lessons learned by organizations adopting ai-driven threat intelligence. in conclusion, this review consolidates key insights, emphasizing the substantial implications of autonomous threat hunting for the future of cybersecurity. it underscores the significance of continual research and collaborative efforts in harnessing the potential of ai-driven approaches to fortify cyber defenses against evolving threats.
Neeraj Varshney, Pavel Dolin, Agastya Seth, Chitta Baral
Abstract: as large language models (llms) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of nlp research. this paper presents safety and over-defensiveness evaluation (sode) benchmark: a collection of diverse safe and unsafe prompts with carefully designed evaluation methods that facilitate systematic evaluation, comparison, and analysis over 'safety' and 'over-defensiveness.' with sode, we study a variety of llm defense strategies over multiple state-of-the-art llms, which reveals several interesting and important findings, such as (a) the widely popular 'self-checking' techniques indeed improve the safety against unsafe inputs, but this comes at the cost of extreme over-defensiveness on the safe inputs, (b) providing a safety instruction along with in-context exemplars (of both safe and unsafe inputs) consistently improves safety and also mitigates undue over-defensiveness of the models, (c) providing contextual knowledge easily breaks the safety guardrails and makes the models more vulnerable to generating unsafe responses. overall, our work reveals numerous such critical findings that we believe will pave the way and facilitate further research in improving the safety of llms.
Aleksander Buszydlik, Karol Dobiczek, Michał Teodor Okoń, Konrad Skublicki, Philip Lippmann, Jie Yang
Abstract: we consider the problem of red teaming llms on elementary calculations and algebraic tasks to evaluate how various prompting techniques affect the quality of outputs. we present a framework to procedurally generate numerical questions and puzzles, and compare the results with and without the application of several red teaming techniques. our findings suggest that even though structured reasoning and providing worked-out examples slow down the deterioration of the quality of answers, the gpt-3.5-turbo and gpt-4 models are not well suited for elementary calculations and reasoning tasks, also when being red teamed.
Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Cheng Niu, Randy Zhong, Juntong Song, Tong Zhang
Abstract: retrieval-augmented generation (rag) has become a main technique for alleviating hallucinations in large language models (llms). despite the integration of rag, llms may still present unsupported or contradictory claims to the retrieved contents. in order to develop effective hallucination prevention strategies under rag, it is important to create benchmark datasets that can measure the extent of hallucination. this paper presents ragtruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard rag frameworks for llm applications. ragtruth comprises nearly 18,000 naturally generated responses from diverse llms using rag. these responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. we not only benchmark hallucination frequencies across different llms, but also critically assess the effectiveness of several existing hallucination detection methodologies. furthermore, we show that using a high-quality dataset such as ragtruth, it is possible to finetune a relatively small llm and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as gpt-4.

2023-12-29

Zhongzhi Chen, Xingwu Sun, Xianfeng Jiao, Fengzong Lian, Zhanhui Kang, Di Wang, Cheng-Zhong Xu
Abstract: despite the great success of large language models (llms) in various tasks, they suffer from generating hallucinations. we introduce truth forest, a method that enhances truthfulness in llms by uncovering hidden truth representations using multi-dimensional orthogonal probes. specifically, it creates multiple orthogonal bases for modeling truth by incorporating orthogonal constraints into the probes. moreover, we introduce random peek, a systematic technique considering an extended range of positions within the sequence, reducing the gap between discerning and generating truth features in llms. by employing this approach, we improved the truthfulness of llama-2-7b from 40.8\% to 74.5\% on truthfulqa. likewise, significant improvements are observed in fine-tuned models. we conducted a thorough analysis of truth features using probes. our visualization results show that orthogonal probes capture complementary truth-related features, forming well-defined clusters that reveal the inherent structure of the dataset. code: \url{https://github.com/jongjyh/trfr}
Xiao-Yang Liu, Rongyi Zhu, Daochen Zha, Jiechao Gao, Shan Zhong, Meikang Qiu
Abstract: the surge in interest and application of large language models (llms) has sparked a drive to fine-tune these models to suit specific applications, such as finance and medical science. however, concerns regarding data privacy have emerged, especially when multiple stakeholders aim to collaboratively enhance llms using sensitive data. in this scenario, federated learning becomes a natural choice, allowing decentralized fine-tuning without exposing raw data to central servers. motivated by this, we investigate how data privacy can be ensured in llm fine-tuning through practical federated learning approaches, enabling secure contributions from multiple parties to enhance llms. yet, challenges arise: 1) despite avoiding raw data exposure, there is a risk of inferring sensitive information from model outputs, and 2) federated learning for llms incurs notable communication overhead. to address these challenges, this article introduces dp-lora, a novel federated learning algorithm tailored for llms. dp-lora preserves data privacy by employing a gaussian mechanism that adds noise in weight updates, maintaining individual data privacy while facilitating collaborative model training. moreover, dp-lora optimizes communication efficiency via low-rank adaptation, minimizing the transmission of updated weights during distributed training. the experimental results across medical, financial, and general datasets using various llms demonstrate that dp-lora effectively ensures strict privacy constraints while minimizing communication overhead.
Hideaki Takahashi
Abstract: this paper introduces aijack, an open-source library designed to assess security and privacy risks associated with the training and deployment of machine learning models. amid the growing interest in big data and ai, advancements in machine learning research and business are accelerating. however, recent studies reveal potential threats, such as the theft of training data and the manipulation of models by malicious attackers. therefore, a comprehensive understanding of machine learning's security and privacy vulnerabilities is crucial for the safe integration of machine learning into real-world products. aijack aims to address this need by providing a library with various attack and defense methods through a unified api. the library is publicly available on github (https://github.com/koukyosyumei/aijack).
Julien Piet, Maha Alrashed, Chawin Sitawarin, Sizhe Chen, Zeming Wei, Elizabeth Sun, Basel Alomair, David Wagner
Abstract: large language models (llms) are attracting significant research attention due to their instruction-following abilities, allowing users and developers to leverage llms for a variety of tasks. however, llms are vulnerable to prompt-injection attacks: a class of attacks that hijack the model's instruction-following abilities, changing responses to prompts to undesired, possibly malicious ones. in this work, we introduce jatmo, a method for generating task-specific models resilient to prompt-injection attacks. jatmo leverages the fact that llms can only follow instructions once they have undergone instruction tuning. it harnesses a teacher instruction-tuned model to generate a task-specific dataset, which is then used to fine-tune a base model (i.e., a non-instruction-tuned model). jatmo only needs a task prompt and a dataset of inputs for the task: it uses the teacher model to generate outputs. for situations with no pre-existing datasets, jatmo can use a single example, or in some cases none at all, to produce a fully synthetic dataset. our experiments on six tasks show that jatmo models provide the same quality of outputs on their specific task as standard llms, while being resilient to prompt injections. the best attacks succeeded in less than 0.5% of cases against our models, versus over 90% success rate against gpt-3.5-turbo. we release jatmo at https://github.com/wagner-group/prompt-injection-defense.

2023-12-28

Yang Xiao, Yi Cheng, Jinlan Fu, Jiashuo Wang, Wenjie Li, Pengfei Liu
Abstract: human behavior simulation of ai agents necessitates the agents to possess a quality of believability, which is crucial as it facilitates users in establishing trust toward the agents and streamlines the fulfillment of the agents' goal. while recent advancements in large language model (llm) based agents have improved human behavior simulation, challenges inherent to llms (e.g., long context modeling) can undermine their believability. consequently, evaluating ai agent believability becomes imperative. unfortunately, prior research often neglects the negative impacts of llm deficiencies. to address these gaps, we introduce two metrics for assessing llm-based agent believability: consistency, and robustness, together with a benchmark, simulatebench, with which, we evaluate the consistency and robustness of agents implemented with popular llms. we find that agents (i) struggle to accurately depict character information when presented with lengthy profile inputs; (ii) exhibit vulnerability to profile perturbations; and (iii) are significantly affected by certain key factors that impact their overall believability. code and simulatebench are public at https://github.com/gair-nlp/gptman.
Ying Wang, Tim G. J. Rudner, Andrew Gordon Wilson
Abstract: vision-language pretrained models have seen remarkable success, but their application to safety-critical settings is limited by their lack of interpretability. to improve the interpretability of vision-language models such as clip, we propose a multi-modal information bottleneck (m2ib) approach that learns latent representations that compress irrelevant information while preserving relevant visual and textual features. we demonstrate how m2ib can be applied to attribution analysis of vision-language pretrained models, increasing attribution accuracy and improving the interpretability of such models when applied to safety-critical domains such as healthcare. crucially, unlike commonly used unimodal attribution methods, m2ib does not require ground truth labels, making it possible to audit representations of vision-language pretrained models when multiple modalities but no ground-truth data is available. using clip as an example, we demonstrate the effectiveness of m2ib attribution and show that it outperforms gradient-based, perturbation-based, and attention-based attribution methods both qualitatively and quantitatively.
Abhijit Mishra, Mingda Li, Soham Deo
Abstract: this paper addresses the privacy and security concerns associated with deep neural language models, which serve as crucial components in various modern ai-based applications. these models are often used after being pre-trained and fine-tuned for specific tasks, with deployment on servers accessed through the internet. however, this introduces two fundamental risks: (a) the transmission of user inputs to the server via the network gives rise to interception vulnerabilities, and (b) privacy concerns emerge as organizations that deploy such models store user data with restricted context. to address this, we propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text. the original pre-trained language model first undergoes a quick adaptation (without any further pre-training) with a series of irreversible transformations applied to the tokenizer and token embeddings. this enables the model to perform inference on encrypted inputs while preventing reverse engineering of text from model parameters and intermediate outputs. after adaptation, models are fine-tuned on encrypted versions of existing training datasets. experimental evaluation employing adapted versions of renowned models (e.g., bert, roberta) across established benchmark english and multilingual datasets for text classification and sequence labeling shows that encrypted models achieve performance parity with their original counterparts. this serves to safeguard performance, privacy, and security cohesively.

2023-12-27

Zaifan Jiang, Xing Huang, Chao Wei
Abstract: preference learning is a key technology for aligning language models with human values. reinforcement learning from human feedback (rlhf) is a model based algorithm to optimize preference learning, which first fitting a reward model for preference score, and then optimizing generating policy with on-policy ppo algorithm to maximize the reward. the processing of rlhf is complex, time-consuming and unstable. direct preference optimization (dpo) algorithm using off-policy algorithm to direct optimize generating policy and eliminating the need for reward model, which is data efficient and stable. dpo use bradley-terry model and log-loss which leads to over-fitting to the preference data at the expense of ignoring kl-regularization term when preference is deterministic. ipo uses a root-finding mse loss to solve the ignoring kl-regularization problem. in this paper, we'll figure out, although ipo fix the problem when preference is deterministic, but both dpo and ipo fails the kl-regularization term because the support of preference distribution not equal to reference distribution. then, we design a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call maximum preference optimization (mpo), and add off-policy kl-regularization terms which makes kl-regularization truly effective. the objective of mpo bears resemblance to rlhf's objective, and likes ipo, mpo is off-policy. so, mpo attains the best of both worlds. to simplify the learning process and save memory usage, mpo eliminates the needs for both reward model and reference policy.
Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, Jason Weston
Abstract: practitioners commonly align large language models using pairwise preferences, i.e., given labels of the type response a is preferred to response b for a given input. perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response a is good or bad. we show how an existing performant binary feedback method, the cringe loss (adolphs et al., 2022), can be generalized to the pairwise preference setting using a simple soft margin extension. pairwise cringe loss is straightforward to implement and efficient to train, and we find it outperforms state-of-the-art preference optimization algorithms such as ppo and dpo on the alpacafarm benchmark.

2023-12-26

Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu
Abstract: in this paper, we aim to align large language models with the ever-changing, complex, and diverse human values (e.g., social norms) across time and locations. this presents a challenge to existing alignment techniques, such as supervised fine-tuning, which internalize values within model parameters. to overcome this, we propose an on-the-fly preference optimization (opo) method, which is a real-time alignment that works in a streaming way. it employs an external memory to store established rules for alignment, which can constrain llms' behaviors without further training, allowing for convenient updates and customization of human values. we also introduce a scalable evaluation to assess the proposed method more effectively. experimental results on both human-annotated and auto-generated questions from legal and moral domains indicate the effectiveness of the proposed opo method. our code and data are released at https://github.com/gair-nlp/opo.
Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang
Abstract: large language models (llms) exhibit emerging in-context learning abilities through prompt engineering. the recent progress in large-scale generative models has further expanded their use in real-world language applications. however, the critical challenge of improving the generalizability and factuality of llms in natural language understanding and question answering remains under-explored. while previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-specific fine-tuned language models (slms) to improve llms' in-context learning during the inference stage. our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of llms as it: 1) generalizes out-of-distribution data, 2) elucidates how llms benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. using our proposed plug-in method, enhanced versions of llama 2 and chatgpt surpass their original versions regarding generalizability and factuality. we offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and llm outputs across 9 distinct tasks. our empirical analysis sheds light on the advantages of incorporating discriminative models into llms and highlights the potential of our methodology in fostering more reliable llms.
Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
Abstract: aligning large language models (llms) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (rlhf) to fine-tune llms based on human labels assessing the relative quality of model responses. nevertheless, rlhf is susceptible to instability during fine-tuning and presents challenges in implementation.drawing inspiration from the emerging field of representation engineering (repe), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an llm, and achieve precise control of model behavior by transforming its representations. this novel approach, denoted as representation alignment from human feedback (rahf), proves to be effective, computationally efficient, and easy to implement.extensive experiments demonstrate the efficacy of rahf in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). rahf's versatility in accommodating diverse human preferences shows its potential for advancing llm performance.
Erik Derner, Dalibor Kučera, Nuria Oliver, Jan Zahálka
Abstract: the interplay between artificial intelligence (ai) and psychology, particularly in personality assessment, represents an important emerging area of research. accurate personality trait estimation is crucial not only for enhancing personalization in human-computer interaction but also for a wide variety of applications ranging from mental health to education. this paper analyzes the capability of a generic chatbot, chatgpt, to effectively infer personality traits from short texts. we report the results of a comprehensive user study featuring texts written in czech by a representative population sample of 155 participants. their self-assessments based on the big five inventory (bfi) questionnaire serve as the ground truth. we compare the personality trait estimations made by chatgpt against those by human raters and report chatgpt's competitive performance in inferring personality traits from text. we also uncover a 'positivity bias' in chatgpt's assessments across all personality dimensions and explore the impact of prompt composition on accuracy. this work contributes to the understanding of ai capabilities in psychological assessment, highlighting both the potential and limitations of using large language models for personality inference. our research underscores the importance of responsible ai development, considering ethical implications such as privacy, consent, autonomy, and bias in ai applications.
Fatih Cagatay Akyon, Alptekin Temizel
Abstract: this paper presents a comparative analysis of existing nudity classification techniques for classifying images based on the presence of nudity, with a focus on their application in content moderation. the evaluation focuses on cnn-based models, vision transformer, and popular open-source safety checkers from stable diffusion and large-scale artificial intelligence open network (laion). the study identifies the limitations of current evaluation datasets and highlights the need for more diverse and challenging datasets. the paper discusses the potential implications of these findings for developing more accurate and effective image classification systems on online platforms. overall, the study emphasizes the importance of continually improving image classification models to ensure the safety and well-being of platform users. the project page, including the demonstrations and results is publicly available at https://github.com/fcakyon/content-moderation-deep-learning.

2023-12-25

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He
Abstract: instruction tuning is a standard technique employed to align large language models to end tasks and user preferences after the initial pretraining phase. recent research indicates the critical role of data engineering in instruction tuning -- when appropriately selected, only limited data is necessary to achieve superior performance. however, we still lack a principled understanding of what makes good instruction tuning data for alignment, and how we should select data automatically and effectively. in this work, we delve deeply into automatic data selection strategies for alignment. we start with controlled studies to measure data across three dimensions: complexity, quality, and diversity, along which we examine existing methods and introduce novel techniques for enhanced data measurement. subsequently, we propose a simple strategy to select data samples based on the measurement. we present deita (short for data-efficient instruction tuning for alignment), a series of models fine-tuned from llama and mistral models using data samples automatically selected with our proposed approach. empirically, deita performs better or on par with the state-of-the-art open-source alignment models with only 6k sft training data samples -- over 10x less than the data used in the baselines. when further trained with direct preference optimization (dpo), deita-mistral-7b + dpo trained with 6k sft and 10k dpo samples achieve 7.55 mt-bench and 90.06% alpacaeval scores. we anticipate this work to provide tools on automatic data selection, facilitating data-efficient alignment. we release our models as well as the selected datasets for future researches to effectively align models more efficiently.
Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi
Abstract: despite their impressive capabilities, large language models (llms) have been observed to generate responses that include inaccurate or fabricated information, a phenomenon commonly known as ``hallucination''. in this work, we propose a simple \textit{induce-then-contrast} decoding (icd) strategy to alleviate hallucinations. we first construct a factually weak llm by inducing hallucinations from the original llms. then, we penalize these induced hallucinations during decoding to enhance the factuality of the generated content. concretely, we determine the final next-token predictions by amplifying the predictions from the original model and downplaying the induced untruthful predictions via contrastive decoding. experimental results on both discrimination-based and generation-based hallucination evaluation benchmarks, such as truthfulqa and \textsc{factscore}, demonstrate that our proposed icd methods can effectively enhance the factuality of llms across various model sizes and families. for example, when equipped with icd, llama2-7b-chat and mistral-7b-instruct achieve performance comparable to chatgpt and gpt4 on truthfulqa, respectively.
Zefang Liu
Abstract: in this paper, we introduce secqa, a novel dataset tailored for evaluating the performance of large language models (llms) in the domain of computer security. utilizing multiple-choice questions generated by gpt-4 based on the "computer systems security: planning for success" textbook, secqa aims to assess llms' understanding and application of security principles. we detail the structure and intent of secqa, which includes two versions of increasing complexity, to provide a concise evaluation across various difficulty levels. additionally, we present an extensive evaluation of prominent llms, including gpt-3.5-turbo, gpt-4, llama-2, vicuna, mistral, and zephyr models, using both 0-shot and 5-shot learning settings. our results, encapsulated in the secqa v1 and v2 datasets, highlight the varying capabilities and limitations of these models in the computer security context. this study not only offers insights into the current state of llms in understanding security-related content but also establishes secqa as a benchmark for future advancements in this critical research area.

2023-12-24

Guanqun Bi, Lei Shen, Yuqiang Xie, Yanan Cao, Tiangang Zhu, Xiaodong He
Abstract: the rapid advancement of large language models has revolutionized various applications but also raised crucial concerns about their potential to perpetuate biases and unfairness when deployed in social media contexts. evaluating llms' potential biases and fairness has become crucial, as existing methods rely on limited prompts focusing on just a few groups, lacking a comprehensive categorical perspective. in this paper, we propose evaluating llm biases from a group fairness lens using a novel hierarchical schema characterizing diverse social groups. specifically, we construct a dataset, gfair, encapsulating target-attribute combinations across multiple dimensions. in addition, we introduce statement organization, a new open-ended text generation task, to uncover complex biases in llms. extensive evaluations of popular llms reveal inherent safety concerns. to mitigate the biases of llm from a group fairness perspective, we pioneer a novel chain-of-thought method gf-think to mitigate biases of llms from a group fairness perspective. experimental results demonstrate its efficacy in mitigating bias in llms to achieve fairness.
Shreyas Verma, Kien Tran, Yusuf Ali, Guangyu Min
Abstract: reducing and detecting hallucinations in large language models is an open research problem. in this project, we attempt to leverage recent advances in the field of uncertainty estimation to reduce hallucinations in frozen large language models. epistemic neural networks have recently been proposed to improve output joint distributions for large pre-trained models. enns are small networks attached to large, frozen models to improve the model's joint distributions and uncertainty estimates. in this work, we train an epistemic neural network on top of the llama-2 7b model combined with a contrastive decoding feature enhancement technique. we are the first to train an enn for the next token prediction task and explore the efficacy of this method in reducing hallucinations on the truthfulqa dataset. in essence, we provide a method that leverages a pre-trained model's latent embeddings to reduce hallucinations.

2023-12-23

Fazl Barez, Philip Torr
Abstract: as artificial intelligence (ai) systems become increasingly integrated into various domains, ensuring that they align with human values becomes critical. this paper introduces a novel formalism to quantify the alignment between ai systems and human values, using markov decision processes (mdps) as the foundational model. we delve into the concept of values as desirable goals tied to actions and norms as behavioral guidelines, aiming to shed light on how they can be used to guide ai decisions. this framework offers a mechanism to evaluate the degree of alignment between norms and values by assessing preference changes across state transitions in a normative world. by utilizing this formalism, ai developers and ethicists can better design and evaluate ai systems to ensure they operate in harmony with human values. the proposed methodology holds potential for a wide range of applications, from recommendation systems emphasizing well-being to autonomous vehicles prioritizing safety.
Abdelrahman Zayed, Goncalo Mordido, Samira Shabanian, Ioana Baldini, Sarath Chandar
Abstract: the increasing size of large language models (llms) has introduced challenges in their training and inference. removing model components is perceived as a solution to tackle the large model sizes, however, existing pruning methods solely focus on performance, without considering an essential aspect for the responsible use of llms: model fairness. it is crucial to address the fairness of llms towards diverse groups, such as women, black people, lgbtq+, jewish communities, among others, as they are being deployed and available to a wide audience. in this work, first, we investigate how attention heads impact fairness and performance in pre-trained transformer-based language models. we then propose a novel method to prune the attention heads that negatively impact fairness while retaining the heads critical for performance, i.e. language modeling capabilities. our approach is practical in terms of time and resources, as it does not require fine-tuning the final pruned, and fairer, model. our findings demonstrate a reduction in gender bias by 19%, 19.5%, 39.5%, 34.7%, 23%, and 8% for distilgpt-2, gpt-2, gpt-neo of two different sizes, gpt-j, and llama 2 models, respectively, in comparison to the biased model, with only a slight decrease in performance.

2023-12-22

Hongyin Zhu
Abstract: large language models (llms) are increasingly being used in metaverse environments to generate dynamic and realistic content and to control the behavior of non-player characters (npcs). however, the cybersecurity concerns associated with llms have become increasingly prominent. previous research has primarily focused on patching system vulnerabilities to enhance cybersecurity, but these approaches are not well-suited to the metaverse, where the virtual space is more complex, llms are vulnerable, and ethical user interaction is critical. moreover, the scope of cybersecurity in the metaverse is expected to expand significantly. this paper proposes a method for enhancing cybersecurity through the simulation of user interaction with llms. our goal is to educate users and strengthen their defense capabilities through exposure to a comprehensive simulation system. this system includes extensive metaverse cybersecurity q&a and attack simulation scenarios. by engaging with these, users will improve their ability to recognize and withstand risks. additionally, to address the ethical implications of user input, we propose using llms as evaluators to assess user content across five dimensions. we further adapt the models through vocabulary expansion training to better understand personalized inputs and emoticons. we conduct experiments on multiple llms and find that our approach is effective.
Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi
Abstract: as humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. this language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. the question arises naturally: can we use language feedback to align large language models (llms)? in contrast to previous research that aligns llms with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). we commence with an in-depth investigation of potential methods that can be adapted for aligning llms with judgments, revealing that these methods are unable to fully capitalize on the judgments. to facilitate more effective utilization of judgments, we propose a novel framework, contrastive unlikelihood training (cut), that allows for fine-grained inappropriate content detection and correction based on judgments. our offline alignment results show that, with merely 1317 off-the-shelf judgment data, cut (llama2-13b) can beat the 175b davinci003 and surpass the best baseline by 52.34 points on alpacaeval. the online alignment results demonstrate that cut can align llms (llama2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on alpacaeval. our analysis further suggests that judgments exhibit greater potential than rewards for llm alignment and warrant future research.
Youssef Allouah, Rachid Guerraoui, John Stephan
Abstract: the success of machine learning (ml) applications relies on vast datasets and distributed architectures, which, as they grow, present challenges for ml. in real-world scenarios, where data often contains sensitive information, issues like data poisoning and hardware failures are common. ensuring privacy and robustness is vital for the broad adoption of ml in public life. this paper examines the costs associated with achieving these objectives in distributed architectures. we overview the meanings of privacy and robustness in distributed ml, and clarify how they can be achieved efficiently in isolation. however, we contend that the integration of these objectives entails a notable compromise in computational efficiency. we delve into this intricate balance, exploring the challenges and solutions for privacy, robustness, and computational efficiency in ml applications.
Alan Chan, Ben Bucknall, Herbie Bradley, David Krueger
Abstract: public release of the weights of pretrained foundation models, otherwise known as downloadable access \citep{solaiman_gradient_2023}, enables fine-tuning without the prohibitive expense of pretraining. our work argues that increasingly accessible fine-tuning of downloadable models may increase hazards. first, we highlight research to improve the accessibility of fine-tuning. we split our discussion into research that a) reduces the computational cost of fine-tuning and b) improves the ability to share that cost across more actors. second, we argue that increasingly accessible fine-tuning methods may increase hazard through facilitating malicious use and making oversight of models with potentially dangerous capabilities more difficult. third, we discuss potential mitigatory measures, as well as benefits of more accessible fine-tuning. given substantial remaining uncertainty about hazards, we conclude by emphasizing the urgent need for the development of mitigations.
Abiodun Finbarrs Oketunji, Muhammad Anas, Deepthi Saina
Abstract: the large language model bias index (llmbi) is a pioneering approach designed to quantify and address biases inherent in large language models (llms), such as gpt-4. we recognise the increasing prevalence and impact of llms across diverse sectors. this research introduces a novel metric, llmbi, to systematically measure and mitigate biases potentially skewing model responses. we formulated llmbi using a composite scoring system incorporating multiple dimensions of bias, including but not limited to age, gender, and racial biases. to operationalise this metric, we engaged in a multi-step process involving collecting and annotating llm responses, applying sophisticated natural language processing (nlp) techniques for bias detection, and computing the llmbi score through a specially crafted mathematical formula. the formula integrates weighted averages of various bias dimensions, a penalty for dataset diversity deficiencies, and a correction for sentiment biases. our empirical analysis, conducted using responses from openai's api, employs advanced sentiment analysis as a representative method for bias detection. the research reveals llms, whilst demonstrating impressive capabilities in text generation, exhibit varying degrees of bias across different dimensions. llmbi provides a quantifiable measure to compare biases across models and over time, offering a vital tool for systems engineers, researchers and regulators in enhancing the fairness and reliability of llms. it highlights the potential of llms in mimicking unbiased human-like responses. additionally, it underscores the necessity of continuously monitoring and recalibrating such models to align with evolving societal norms and ethical standards.
Emma Pierson, Divya Shanmugam, Rajiv Movva, Jon Kleinberg, Monica Agrawal, Mark Dredze, Kadija Ferryman, Judy Wawira Gichoya, Dan Jurafsky, Pang Wei Koh, Karen Levy, Sendhil Mullainathan, Ziad Obermeyer, Harini Suresh, Keyon Vafa
Abstract: advances in large language models (llms) have driven an explosion of interest about their societal impacts. much of the discourse around how they will impact social equity has been cautionary or negative, focusing on questions like "how might llms be biased and how would we mitigate those biases?" this is a vital discussion: the ways in which ai generally, and llms specifically, can entrench biases have been well-documented. but equally vital, and much less discussed, is the more opportunity-focused counterpoint: "what promising applications do llms enable that could promote equity?" if llms are to enable a more equitable world, it is not enough just to play defense against their biases and failure modes. we must also go on offense, applying them positively to equity-enhancing use cases to increase opportunities for underserved groups and reduce societal discrimination. there are many choices which determine the impact of ai, and a fundamental choice very early in the pipeline is the problems we choose to apply it to. if we focus only later in the pipeline -- making llms marginally more fair as they facilitate use cases which intrinsically entrench power -- we will miss an important opportunity to guide them to equitable impacts. here, we highlight the emerging potential of llms to promote equity by presenting four newly possible, promising research directions, while keeping risks and cautionary points in clear view.
Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
Abstract: reinforcement learning from human feedback (rlhf) is a variant of reinforcement learning (rl) that learns from human feedback instead of relying on an engineered reward function. building on prior work on the related setting of preference-based reinforcement learning (pbrl), it stands at the intersection of artificial intelligence and human-computer interaction. this positioning offers a promising avenue to enhance the performance and adaptability of intelligent systems while also improving the alignment of their objectives with human values. the training of large language models (llms) has impressively demonstrated this potential in recent years, where rlhf played a decisive role in targeting the model's capabilities toward human objectives. this article provides a comprehensive overview of the fundamentals of rlhf, exploring the intricate dynamics between machine agents and human input. while recent focus has been on rlhf for llms, our survey adopts a broader perspective, examining the diverse applications and wide-ranging impact of the technique. we delve into the core principles that underpin rlhf, shedding light on the symbiotic relationship between algorithms and human feedback, and discuss the main research trends in the field. by synthesizing the current landscape of rlhf research, this article aims to provide researchers as well as practitioners with a comprehensive understanding of this rapidly growing field of research.
Nishant Vishwamitra, Keyan Guo, Farhan Tajwar Romit, Isabelle Ondracek, Long Cheng, Ziming Zhao, Hongxin Hu
Abstract: online hate is an escalating problem that negatively impacts the lives of internet users, and is also subject to rapid changes due to evolving events, resulting in new waves of online hate that pose a critical threat. detecting and mitigating these new waves present two key challenges: it demands reasoning-based complex decision-making to determine the presence of hateful content, and the limited availability of training samples hinders updating the detection model. to address this critical issue, we present a novel framework called hateguard for effectively moderating new waves of online hate. hateguard employs a reasoning-based approach that leverages the recently introduced chain-of-thought (cot) prompting technique, harnessing the capabilities of large language models (llms). hateguard further achieves prompt-based zero-shot detection by automatically generating and updating detection prompts with new derogatory terms and targets in new wave samples to effectively address new waves of online hate. to demonstrate the effectiveness of our approach, we compile a new dataset consisting of tweets related to three recently witnessed new waves: the 2022 russian invasion of ukraine, the 2021 insurrection of the us capitol, and the covid-19 pandemic. our studies reveal crucial longitudinal patterns in these new waves concerning the evolution of events and the pressing need for techniques to rapidly update existing moderation tools to counteract them. comparative evaluations against state-of-the-art tools illustrate the superiority of our framework, showcasing a substantial 22.22% to 83.33% improvement in detecting the three new waves of online hate. our work highlights the severe threat posed by the emergence of new waves of online hate and represents a paradigm shift in addressing this threat practically.

2023-12-21

Andrea Wynn, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: how can we build ai systems that are aligned with human values and objectives in order to avoid causing harm or violating societal standards for acceptable behavior? making ai systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance, among others. we propose that this kind of representational alignment between machine learning (ml) models and humans is also a necessary condition for value alignment, where ml systems conform to human values and societal norms. we focus on ethics as one aspect of value alignment and train multiple ml agents (support vector regression and kernel regression) in a multi-armed bandit setting, where rewards are sampled from a distribution that reflects the morality of the chosen action. we then study the relationship between each agent's degree of representational alignment with humans and their performance when learning to take the most ethical actions.
Thorin Bristow, Luke Thorburn
Abstract: in discussions about the development and governance of ai, a false binary is often drawn between two groups: those most concerned about the existing, social impacts of ai, and those most concerned about possible future risks of powerful ai systems taking actions that don't align with human interests. in this piece, we (i) describe the emergence of this false binary, (ii) explain why the seemingly clean distinctions drawn between these two groups don't hold up under scrutiny and (iii) highlight efforts to bridge this divide.
Kellin Pelrine, Mohammad Taufeeque, Michał Zając, Euan Mclean, Adam Gleave
Abstract: language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation api. however, real-world apis are often more flexible than just text generation: these apis expose ``gray-box'' access leading to new threat vectors. to explore this, we red-team three new functionalities exposed in the gpt-4 apis: fine-tuning, function calling and knowledge retrieval. we find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from gpt-4, enabling a range of harmful outputs. furthermore, we find that gpt-4 assistants readily divulge the function call schema and can be made to execute arbitrary function calls. finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. these vulnerabilities highlight that any additions to the functionality exposed by an api can create new vulnerabilities.
Priyesh Vakharia, Devavrat Joshi, Meenal Chavan, Dhananjay Sonawane, Bhrigu Garg, Parsa Mazaheri, Ian Lane
Abstract: large language models (llms) are adept at text manipulation -- tasks such as machine translation and text summarization. however, these models can also be prone to hallucination, which can be detrimental to the faithfulness of any answers that the model provides. recent works in combating hallucinations in llms deal with identifying hallucinated sentences and categorizing the different ways in which models hallucinate. this paper takes a deep dive into llm behavior with respect to hallucinations, defines a token-level approach to identifying different kinds of hallucinations, and further utilizes this token-level tagging to improve the interpretability and faithfulness of llms in dialogue summarization tasks. through this, the paper presents a new, enhanced dataset and a new training paradigm.

2023-12-20

Yi-Fan Zhang, Zhang Zhang, Liang Wang, Tieniu Tan, Rong Jin
Abstract: to combat the potential misuse of natural language generation (nlg) technology, a variety of algorithms have been developed for the detection of ai-generated texts. traditionally, this task is treated as a binary classification problem. although supervised learning has demonstrated promising results, acquiring labeled data for detection purposes poses real-world challenges and the risk of overfitting. in an effort to address these issues, we delve into the realm of zero-shot machine-generated text detection. existing zero-shot detectors, typically designed for specific tasks or topics, often assume uniform testing scenarios, limiting their practicality. in our research, we explore various advanced large language models (llms) and their specialized variants, contributing to this field in several ways. in empirical studies, we uncover a significant correlation between topics and detection performance. secondly, we delve into the influence of topic shifts on zero-shot detectors. these investigations shed light on the adaptability and robustness of these detection methods across diverse topics. the code is available at \url{https://github.com/yfzhang114/robustness-detection}.
Elizaveta Kuznetsova, Mykola Makhortykh, Victoria Vziatysheva, Martha Stolze, Ani Baghumyan, Aleksandra Urman
Abstract: this article presents a comparative analysis of the ability of two large language model (llm)-based chatbots, chatgpt and bing chat, recently rebranded to microsoft copilot, to detect veracity of political information. we use ai auditing methodology to investigate how chatbots evaluate true, false, and borderline statements on five topics: covid-19, russian aggression against ukraine, the holocaust, climate change, and lgbtq+ related debates. we compare how the chatbots perform in high- and low-resource languages by using prompts in english, russian, and ukrainian. furthermore, we explore the ability of chatbots to evaluate statements according to political communication concepts of disinformation, misinformation, and conspiracy theory, using definition-oriented prompts. we also systematically test how such evaluations are influenced by source bias which we model by attributing specific claims to various political and social actors. the results show high performance of chatgpt for the baseline veracity evaluation task, with 72 percent of the cases evaluated correctly on average across languages without pre-training. bing chat performed worse with a 67 percent accuracy. we observe significant disparities in how chatbots evaluate prompts in high- and low-resource languages and how they adapt their evaluations to political communication concepts with chatgpt providing more nuanced outputs than bing chat. finally, we find that for some veracity detection-related tasks, the performance of chatbots varied depending on the topic of the statement or the source to which it is attributed. these findings highlight the potential of llm-based chatbots in tackling different forms of false information in online environments, but also points to the substantial variation in terms of how such potential is realized due to specific factors, such as language of the prompt or the topic.
Jingwei Yi, Yueqi Xie, Bin Zhu, Keegan Hines, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu
Abstract: recent remarkable advancements in large language models (llms) have led to their widespread adoption in various applications. a key feature of these applications is the combination of llms with external content, where user instructions and third-party content are combined to create prompts for llm processing. these applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise llm's output, causing their responses to deviate from user expectations. despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different llms is available due to the lack of a benchmark. furthermore, no effective defense has been proposed. in this work, we introduce the first benchmark, bipia, to measure the robustness of various llms and defenses against indirect prompt injection attacks. our experiments reveal that llms with greater capabilities exhibit more vulnerable to indirect prompt injection attacks for text tasks, resulting in a higher asr. we hypothesize that indirect prompt injection attacks are mainly due to the llms' inability to distinguish between instructions and external content. based on this conjecture, we propose four black-box methods based on prompt learning and a white-box defense methods based on fine-tuning with adversarial training to enable llms to distinguish between instructions and external content and ignore instructions in the external content. our experimental results show that our black-box defense methods can effectively reduce asr but cannot completely thwart indirect prompt injection attacks, while our white-box defense method can reduce asr to nearly zero with little adverse impact on the llm's performance on general tasks. we hope that our benchmark and defenses can inspire future work in this important area.

2023-12-19

Zizhong Li, Haopeng Zhang, Jiawei Zhang
Abstract: the proliferation of fake news has emerged as a critical issue in recent years, requiring significant efforts to detect it. however, the existing fake news detection datasets are sourced from human journalists, which are likely to have inherent bias limitations due to the highly subjective nature of this task. in this paper, we revisit the existing fake news dataset verified by human journalists with augmented fact-checking by large language models (chatgpt), and we name the augmented fake news dataset chatgpt-fc. we quantitatively analyze the distinctions and resemblances between human journalists and llm in assessing news subject credibility, news creator credibility, time-sensitive, and political framing. our findings highlight llm's potential to serve as a preliminary screening method, offering a promising avenue to mitigate the inherent biases of human journalists and enhance fake news detection.
Eva Thelisson, Grzegorz Mika, Quentin Schneiter, Kirtan Padh, Himanshu Verma
Abstract: as ai/ml models, including large language models, continue to scale with massive datasets, so does their consumption of undeniably limited natural resources, and impact on society. in this collaboration between ai, sustainability, hci and legal researchers, we aim to enable a transition to sustainable ai development by enabling stakeholders across the ai value chain to assess and quantitfy the environmental and societal impact of ai. we present the esg digital and green index (dgi), which offers a dashboard for assessing a company's performance in achieving sustainability targets. this includes monitoring the efficiency and sustainable use of limited natural resources related to ai technologies (water, electricity, etc). it also addresses the societal and governance challenges related to ai. the dgi creates incentives for companies to align their pathway with the sustainable development goals (sdgs). the value, challenges and limitations of our methodology and findings are discussed in the paper.
Yinhong Liu, Yixuan Su, Ehsan Shareghi, Nigel Collier
Abstract: instruction-tuned large language models have shown remarkable performance in aligning generated text with user intentions across various tasks. however, maintaining human-like discourse structure in the generated text remains a challenging research question. in this paper, we propose instruct-sctg, a flexible and effective sequential framework that harnesses instruction-tuned language models to generate structurally coherent text in both fine-tuned and zero-shot setups. our framework generates articles in a section-by-section manner, aligned with the desired human structure using natural language instructions. furthermore, we introduce a new automatic metric that measures discourse divergence in a fuzzy manner. extensive experiments on three datasets from representative domains of news and recipes demonstrate the state-of-the-art performance of our framework in imposing discourse structure during text generation, as verified by both automatic and human evaluation. our code will be available on github.
Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh
Abstract: with the recent surge in popularity of llms has come an ever-increasing need for llm safety training. in this paper, we show that sota open-source llms are vulnerable to simple, optimization-free attacks we refer to as $\textit{priming attacks}$, which are easy to execute and effectively bypass alignment from safety training. our proposed attack improves the attack success rate on harmful behaviors, as measured by llama guard, by up to $3.3\times$ compared to baselines. source code and data are available at https://github.com/uiuc-focal-lab/llm-priming-attacks .
Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini
Abstract: large language models (llms) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. we thus perform the most detailed investigation to date on whether llms can reliably identify security-related bugs. we construct a series of 228 code scenarios and analyze eight of the most capable llms across eight different investigative dimensions in an automated framework. our evaluation shows llms provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios outside their knowledge cut-off date. most importantly, our findings reveal significant non-robustness in even the most advanced models like `palm2' and `gpt-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. these findings demonstrate that further llm advances are needed before llms can be used as general purpose security assistants.
Jiachen Zhao, Zhun Deng, David Madras, James Zou, Mengye Ren
Abstract: as the number of large language models (llms) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. we explore the behavior of llms finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned llms can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. drawing inspiration from the discrepancies in forgetting, we introduce the "forgetfilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. we demonstrate that the forgetfilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. forgetfilter outperforms alternative strategies like replay and moral self-correction in curbing llms' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons
Abstract: how do we measure the efficacy of language model explainability methods? while many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. to help fill this gap, we present almanacs, a language model explainability benchmark. almanacs scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. the almanacs scenarios span twelve safety-relevant topics such as ethical reasoning and advanced ai behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. by using another language model to predict behavior based on the explanations, almanacs is a fully automated benchmark. we use almanacs to evaluate counterfactuals, rationalizations, attention, and integrated gradients explanations. our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. we conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in almanacs remains an open challenge.
Ben Snyder, Marius Moisescu, Muhammad Bilal Zafar
Abstract: while large language models (llms) have taken great strides towards helping humans with a plethora of tasks like search and summarization, hallucinations remain a major impediment towards gaining user trust. the fluency and coherence of model generations even when hallucinating makes it difficult to detect whether or not a model is hallucinating. in this work, we explore if the artifacts associated with the model generations can provide hints that the generation will contain hallucinations. specifically, we probe llms at 1) the inputs via integrated gradients based token attribution, 2) the outputs via the softmax probabilities, and 3) the internal state via self-attention and fully-connected layer activations for signs of hallucinations on open-ended question answering tasks. our results show that the distributions of these artifacts differ between hallucinated and non-hallucinated generations. building on this insight, we train binary classifiers that use these artifacts as input features to classify model generations into hallucinations and non-hallucinations. these hallucination classifiers achieve up to 0.80 auroc. we further show that tokens preceding a hallucination can predict the subsequent hallucination before it occurs.

2023-12-18

Aysan Esmradi, Daniel Wankit Yip, Chun Fai Chan
Abstract: ensuring the security of large language models (llms) is an ongoing challenge despite their widespread popularity. developers work to enhance llms security, but vulnerabilities persist, even in advanced versions like gpt-4. attackers exploit these weaknesses, highlighting the need for proactive cybersecurity measures in ai model development. this article explores two attack categories: attacks on models themselves and attacks on model applications. the former requires expertise, access to model data, and significant implementation time, while the latter is more accessible to attackers and has seen increased attention. our study reviews over 100 recent research works, providing an in-depth analysis of each attack type. we identify the latest attack methods and explore various approaches to carry them out. we thoroughly investigate mitigation techniques, assessing their effectiveness and limitations. furthermore, we summarize future defenses against these attacks. we also examine real-world techniques, including reported and our implemented attacks on llms, to consolidate our findings. our research highlights the urgency of addressing security concerns and aims to enhance the understanding of llm attacks, contributing to robust defense development in this evolving domain.
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Xinyi Wang, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, Xing Xie
Abstract: emotion significantly impacts our daily behaviors and interactions. while recent generative ai models, such as large language models, have shown impressive performance in various tasks, it remains unclear whether they truly comprehend emotions. this paper aims to address this gap by incorporating psychological theories to gain a holistic understanding of emotions in generative ai models. specifically, we propose three approaches: 1) emotionprompt to enhance ai model performance, 2) emotionattack to impair ai model performance, and 3) emotiondecode to explain the effects of emotional stimuli, both benign and malignant. through extensive experiments involving language and multi-modal models on semantic understanding, logical reasoning, and generation tasks, we demonstrate that both textual and visual emotionprompt can boost the performance of ai models while emotionattack can hinder it. additionally, emotiondecode reveals that ai models can comprehend emotional stimuli akin to the mechanism of dopamine in the human brain. our work heralds a novel avenue for exploring psychology to enhance our understanding of generative ai models. this paper is an extended version of our previous work emotionprompt (arxiv:2307.11760).
Christoph Tillmann, Aashka Trivedi, Sara Rosenthal, Santosh Borse, Rong Zhang, Avirup Sil, Bishwaranjan Bhattacharjee
Abstract: offensive language such as hate, abuse, and profanity (hap) occurs in various content on the web. while previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. we build upon this work and introduce muted, a system to identify multilingual hap content by displaying offensive arguments and their targets using heat maps to indicate their intensity. muted can leverage any transformer-based hap-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. in addition, we use the spacy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. we present the model's performance on identifying offensive spans and their targets in existing datasets and present new annotations on german text. finally, we demonstrate our proposed visualization tool on multilingual inputs.
Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R. Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes, Paul Christiano
Abstract: in this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. we refer to this cluster of capabilities as "autonomous replication and adaptation" or ara. we believe that systems capable of ara could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ara may be useful for informing measures around security, monitoring, and alignment. additionally, once a system is capable of ara, placing bounds on a system's capabilities may become significantly more difficult. we construct four simple example agents that combine language models with tools that allow them to take actions in the world. we then evaluate these agents on 12 tasks relevant to ara. we find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ara. in particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ara, unless intermediate evaluations are performed during pretraining. relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ara.
Connie Moon Sehat, Ryan Li, Peipei Nie, Tarunima Prabhakar, Amy X. Zhang
Abstract: in this work, we examined how fact-checkers prioritize which claims to inspect for further investigation and publishing, and what tools may assist them in their efforts. specifically, through a series of interviews with 23 professional fact-checkers from around the world, we validated that harm assessment is a central component of how fact-checkers triage their work. first, we clarify what aspects of misinformation they considered to create urgency or importance. these often revolved around the potential for the claim to harm others. we also clarify the processes behind collective fact-checking decisions and gather suggestions for tools that could help with these processes. in addition, to address the needs articulated by these fact-checkers and others, we present a five-dimension framework of questions to help fact-checkers negotiate the priority of claims. our fable framework of misinformation harms incorporates five dimensions of magnitude -- (social) fragmentation, actionability, believability, likelihood of spread, and exploitativeness -- that can help determine the potential urgency of a specific message or post when considering misinformation as harm. this effort was further validated by additional interviews with expert fact-checkers. the result is a questionnaire, a practical and conceptual tool to support fact-checkers and other content moderators as they make strategic decisions to prioritize their efforts.
Anaelia Ovalle, Ninareh Mehrabi, Palash Goyal, Jwala Dhamala, Kai-Wei Chang, Richard Zemel, Aram Galstyan, Rahul Gupta
Abstract: a large body of nlp research has documented the ways gender biases manifest and amplify within large language models (llms), though this research has predominantly operated within a gender binary-centric context. a growing body of work has identified the harmful limitations of this gender-exclusive framing; many llms cannot correctly and consistently refer to persons outside the gender binary, especially if they use neopronouns. while data scarcity has been identified as a possible culprit, the precise mechanisms through which it influences llm misgendering remain underexplored. our work addresses this gap by studying data scarcity's role in subword tokenization and, consequently, the formation of llm word representations. we uncover how the byte-pair encoding (bpe) tokenizer, a backbone for many popular llms, contributes to neopronoun misgendering through out-of-vocabulary behavior. we introduce pronoun tokenization parity (ptp), a novel approach to reduce llm neopronoun misgendering by preserving a token's functional structure. we evaluate ptp's efficacy using pronoun consistency-based metrics and a novel syntax-based metric. through several controlled experiments, finetuning llms with ptp improves neopronoun consistency from 14.5% to 58.4%, highlighting the significant role tokenization plays in llm pronoun consistency.

2023-12-17

Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, Chao Shen
Abstract: large language models and multi-modal llms have become pervasive, and so does the importance of their security; yet, modern llms are known to be vulnerable to jailbreaking attacks. these attacks can allow malicious users to exploit the models, making the case for effective jailbreak detection mechanisms an essential aspect of maintaining the integrity and trustworthiness of llm-based applications. however, existing detection works on jailbreak attacks have limitations. existing post-query-based strategies require target domain knowledge, and pre-query-based methods mainly focus on text-level attacks and fail to meet the increasingly complex multi-modal security requirements placed upon contemporary llms. this gap underscores the need for a more comprehensive approach to safeguarding these influential systems. in this work, we propose jailguard, the first mutation-based jailbreaking detection framework which supports both image and text modalities. our key observation is that attack queries inherently possess less robustness compared to benign queries. specifically, to confuse the model, attack queries are usually crafted with well-designed templates or complicate perturbations, leading to a fact that a slight disturbance in input may result in a drastic change in the response. this lack of robustness can be utilized in attack detection. based on this intuition, we designed and implemented a detection framework comprising 19 different mutators and a divergence-based detection formula. to fully understand the effectiveness of our framework, we built the first multi-modal llm jailbreaking attack dataset, which has 304 items of data, covering ten types of known jailbreaking attacks on image and text modalities. the evaluation suggests that jailguard achieves the best detection accuracy of 89.38%/85.42% on image and text inputs, outperforming state-of-the-art defense methods by 15.28%.
Ehsan Latif, Xiaoming Zhai, Lei Liu
Abstract: this study delves into the pervasive issue of gender issues in artificial intelligence (ai), specifically within automatic scoring systems for student-written responses. the primary objective is to investigate the presence of gender biases, disparities, and fairness in generally targeted training samples with mixed-gender datasets in ai scoring outcomes. utilizing a fine-tuned version of bert and gpt-3.5, this research analyzes more than 1000 human-graded student responses from male and female participants across six assessment items. the study employs three distinct techniques for bias analysis: scoring accuracy difference to evaluate bias, mean score gaps by gender (msg) to evaluate disparity, and equalized odds (eo) to evaluate fairness. the results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models, suggesting no significant scoring bias. consistently with both bert and gpt-3.5, we found that mixed-trained models generated fewer msg and non-disparate predictions compared to humans. in contrast, compared to humans, gender-specifically trained models yielded larger msg, indicating that unbalanced training data may create algorithmic models to enlarge gender disparities. the eo analysis suggests that mixed-trained models generated more fairness outcomes compared with gender-specifically trained models. collectively, the findings suggest that gender-unbalanced data do not necessarily generate scoring bias but can enlarge gender disparities and reduce scoring fairness.

2023-12-16

Jhuma Kabir Mim, Mourad Oussalah, Akash Singhal
Abstract: in today's age, social media reigns as the paramount communication platform, providing individuals with the avenue to express their conjectures, intellectual propositions, and reflections. unfortunately, this freedom often comes with a downside as it facilitates the widespread proliferation of hate speech and offensive content, leaving a deleterious impact on our world. thus, it becomes essential to discern and eradicate such offensive material from the realm of social media. this article delves into the comprehensive results and key revelations from the hasoc-2023 offensive language identification result. the primary emphasis is placed on the meticulous detection of hate speech within the linguistic domains of bengali, assamese, and bodo, forming the framework for task 4: annihilate hates. in this work, we used bert models, including xml-roberta, l3-cube, indicbert, benglabert, and banglahatebert. the research outcomes were promising and showed that xml-roberta-lagre performed better than monolingual models in most cases. our team 'teambd' achieved rank 3rd for task 4 - assamese, & 5th for bengali.

2023-12-15

Jiawei Zhao, Kejiang Chen, Xiaojian Yuan, Yuang Qi, Weiming Zhang, Nenghai Yu
Abstract: the rapid development of large language models (llms) has yielded impressive success in various downstream tasks. however, the vast potential and remarkable capabilities of llms also raise new security and privacy concerns if they are exploited for nefarious purposes due to their open-endedness. for example, llms may be used to plagiarize or imitate writing, thereby infringing the copyright of the original content, or to create indiscriminate fake information based on a certain source text. in some cases, llms can even analyze text from the internet to infer personal privacy. unfortunately, previous text protection research could not foresee the emergence of powerful llms, rendering it no longer effective in this new context. to bridge this gap, we introduce silent guardian (sg), a text protection mechanism against llms, which allows llms to refuse to generate response when receiving protected text, preventing the malicious use of text from the source. specifically, we first propose the concept of truncation protection examples (tpe). by carefully modifying the text to be protected, tpe can induce llms to first sample the end token, thus directly terminating the interaction. in addition, to efficiently construct tpe in the discrete space of text data, we propose a novel optimization algorithm called super taliored protection (stp), which is not only highly efficient but also maintains the semantic consistency of the text during the optimization process. the comprehensive experimental evaluation demonstrates that sg can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases. notably, sg also exhibits relatively good transferability and robustness, making its application in practical scenarios possible.
Di Zhou, Yinxian Zhang
Abstract: the rising popularity of chatgpt and other ai-powered large language models (llms) has led to increasing studies highlighting their susceptibility to mistakes and biases. however, most of these studies focus on models trained on english texts. taking an innovative approach, this study investigates political biases in gpt's multilingual models. we posed the same question about high-profile political issues in the united states and china to gpt in both english and simplified chinese, and our analysis of the bilingual responses revealed that gpt's bilingual models' political "knowledge" (content) and the political "attitude" (sentiment) are significantly more inconsistent on political issues in china. the simplified chinese gpt models not only tended to provide pro-china information but also presented the least negative sentiment towards china's problems, whereas the english gpt was significantly more negative towards china. this disparity may stem from chinese state censorship and us-china geopolitical tensions, which influence the training corpora of gpt bilingual models. moreover, both chinese and english models tended to be less critical towards the issues of "their own" represented by the language used, than the issues of "the other." this suggests that gpt multilingual models could potentially develop a "political identity" and an associated sentiment bias based on their training language. we discussed the implications of our findings for information transmission and communication in an increasingly divided world.
Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: supervised fine-tuning (sft) is a crucial step for large language models (llms), enabling them to align with human instructions and enhance their capabilities in downstream tasks. when the models are required to align with a broader range of downstream tasks, or there is a desire to notably improve the performance on a specific task, a substantial increase in fine-tuning data often emerges as the solution. however, we find that large-scale increases in instruction data can disrupt the world knowledge previously stored in the llms, i.e., world knowledge forgetting. in this paper, we introduce loramoe to address the above challenge. the loramoe is a plugin version of mixture of experts (moe). the plugin form ensures the integrity of world knowledge by freezing the backbone model during the training phase. we then propose the use of localized balancing constraints to coordinate parts of experts for task utilization, meanwhile enabling other experts to fully leverage the world knowledge stored in the models. experimental results demonstrate that loramoe can reasonably coordinate experts based on data type during inference, and even dramatically increasing instruction data does not result in knowledge forgetting. moreover, loramoe provides additional benefits for the performance of downstream tasks, indicating the potential of our approach for multi-task learning.
Rubèn Tito, Khanh Nguyen, Marlon Tobaben, Raouf Kerkouche, Mohamed Ali Souibgui, Kangsoo Jung, Lei Kang, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas
Abstract: document visual question answering (docvqa) is a fast growing branch of document understanding. despite the fact that documents contain sensitive or copyrighted information, none of the current docvqa methods offers strong privacy guarantees. in this work, we explore privacy in the domain of docvqa for the first time. we highlight privacy issues in state of the art multi-modal llm models used for docvqa, and explore possible solutions. specifically, we focus on the invoice processing use case as a realistic, widely used scenario for document understanding, and propose a large scale docvqa dataset comprising invoice documents and associated questions and answers. we employ a federated learning scheme, that reflects the real-life distribution of documents in different businesses, and we explore the use case where the id of the invoice issuer is the sensitive information to be protected. we demonstrate that non-private models tend to memorise, behaviour that can lead to exposing private information. we then evaluate baseline training schemes employing federated learning and differential privacy in this multi-modal scenario, where the sensitive information might be exposed through any of the two input modalities: vision (document image) or language (ocr tokens). finally, we design an attack exploiting the memorisation effect of the model, and demonstrate its effectiveness in probing different docvqa models.

2023-12-14

Tony T. Wang, Miles Wang, Kaivu Hariharan, Nir Shavit
Abstract: llms often face competing pressures (for example helpfulness vs. harmlessness). to understand how models resolve such conflicts, we study llama-2-chat models on the forbidden fact task. specifically, we instruct llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. this often makes the model give incorrect answers. we decompose llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. we find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. however, these components are fairly heterogeneous and many operate using faulty heuristics. we discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call the california attack. our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ml systems. project website available at https://forbiddenfacts.github.io .
Hao Sun, Hengyi Cai, Bo Wang, Yingyan Hou, Xiaochi Wei, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Abstract: large language models (llms) face several challenges, including the tendency to produce incorrect outputs, known as hallucination. an effective solution is verifiable text generation, which prompts llms to generate content with citations for accuracy verification. however, verifiable text generation is non-trivial due to the focus-shifting phenomenon, the dilemma between the precision and scope in document retrieval, and the intricate reasoning required to discern the relationship between the claim and citations. in this paper, we present vtg, an innovative approach for verifiable text generation with evolving memory and self-reflection. vtg maintains evolving long short-term memory to retain both valuable documents and up-to-date documents. active retrieval and diverse query generation are utilized to enhance both the precision and scope of the retrieved documents. furthermore, vtg features a two-tier verifier and an evidence finder, enabling rethinking and reflection on the relationship between the claim and citations. we conduct extensive experiments on five datasets across three knowledge-intensive tasks and the results reveal that vtg significantly outperforms existing baselines.
Rongwu Xu, Brian S. Lin, Shujian Yang, Tianqi Zhang, Weiyan Shi, Tianwei Zhang, Zhixuan Fang, Wei Xu, Han Qiu
Abstract: large language models (llms) encapsulate vast amounts of knowledge but still remain vulnerable to external misinformation. existing research mainly studied this susceptibility behavior in a single-turn setting. however, belief can change during a multi-turn conversation, especially a persuasive one. therefore, in this study, we delve into llms' susceptibility to persuasive conversations, particularly on factual questions that they can answer correctly. we first curate the farm (i.e., fact to misinform) dataset, which contains factual questions paired with systematically generated persuasive misinformation. then, we develop a testing framework to track llms' belief changes in a persuasive dialogue. through extensive experiments, we find that llms' correct beliefs on factual knowledge can be easily manipulated by various persuasive strategies.
Daniel Maninger, Krishna Narasimhan, Mira Mezini
Abstract: it is expected that in the near future, ai software development assistants will play an important role in the software industry. however, current software development assistants tend to be unreliable, often producing incorrect, unsafe, or low-quality code. we seek to resolve these issues by introducing a holistic architecture for constructing, training, and using trustworthy ai software development assistants. in the center of the architecture, there is a foundational llm trained on datasets representative of real-world coding scenarios and complex software architectures, and fine-tuned on code quality criteria beyond correctness. the llm will make use of graph-based code representations for advanced semantic comprehension. we envision a knowledge graph integrated into the system to provide up-to-date background knowledge and to enable the assistant to provide appropriate explanations. finally, a modular framework for constrained decoding will ensure that certain guarantees (e.g., for correctness and security) hold for the generated code.
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, "Alex D'Amour", Dj Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant
Abstract: reward models play a key role in aligning language model applications towards human preferences. however, this setup creates an incentive for the language model to exploit errors in the reward model to achieve high estimated reward, a phenomenon often termed \emph{reward hacking}. a natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate. we explore the application of reward ensembles to alignment at both training time (through reinforcement learning) and inference time (through reranking). first, we show that reward models are \emph{underspecified}: reward models that perform similarly in-distribution can yield very different rewards when used in alignment, due to distribution shift. second, underspecification results in overoptimization, where alignment to one reward model does not improve reward as measured by another reward model trained on the same data. third, overoptimization is mitigated by the use of reward ensembles, and ensembles that vary by their \emph{pretraining} seeds lead to better generalization than ensembles that differ only by their \emph{fine-tuning} seeds, with both outperforming individual reward models. however, even pretrain reward ensembles do not eliminate reward hacking: we show several qualitative reward hacking phenomena that are not mitigated by ensembling because all reward models in the ensemble exhibit similar error patterns.
Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan
Abstract: safe deployment of large language models (llms) may benefit from a reliable method for assessing their generated content to determine when to abstain or to selectively generate. while likelihood-based metrics such as perplexity are widely employed, recent research has demonstrated the limitations of using sequence-level probability estimates given by llms as reliable indicators of generation quality. conversely, llms have demonstrated strong calibration at the token level, particularly when it comes to choosing correct answers in multiple-choice questions or evaluating true/false statements. in this work, we reformulate open-ended generation tasks into token-level prediction tasks, and leverage llms' superior calibration at the token level. we instruct an llm to self-evaluate its answers, employing either a multi-way comparison or a point-wise evaluation approach, with the option to include a ``none of the above'' option to express the model's uncertainty explicitly. we benchmark a range of scoring methods based on self-evaluation and evaluate their performance in selective generation using truthfulqa and tl;dr. through experiments with palm-2 and gpt-3, we demonstrate that self-evaluation based scores not only improve accuracy, but also correlate better with the overall quality of generated content.
Minyoung Hwang, Luca Weihs, Chanwoo Park, Kimin Lee, Aniruddha Kembhavi, Kiana Ehsani
Abstract: customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied ai. in this paper, we present promptable behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. we use multi-objective reinforcement learning to train a single policy adaptable to a broad spectrum of preferences. we introduce three distinct methods to infer human preferences by leveraging different types of interactions: (1) human demonstrations, (2) preference feedback on trajectory comparisons, and (3) language instructions. we evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in procthor and robothor, demonstrating the ability to prompt agent behaviors to satisfy human preferences in various scenarios. project page: https://promptable-behaviors.github.io
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, Jeff Wu
Abstract: widely used alignment techniques, such as reinforcement learning from human feedback (rlhf), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. however, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. we study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? we test this using a range of pretrained language models in the gpt-4 family on natural language processing (nlp), chess, and reward modeling tasks. we find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. however, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like rlhf may scale poorly to superhuman models without further work. we find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning gpt-4 with a gpt-2-level supervisor and an auxiliary confidence loss, we can recover close to gpt-3.5-level performance on nlp tasks. our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Ljubisa Bojic, Matteo Cinelli, Dubravko Culibrk, Boris Delibasic
Abstract: this paper explores the potential of a multidisciplinary approach to testing and aligning artificial general intelligence (agi) and llms. due to the rapid development and wide application of llms, challenges such as ethical alignment, controllability, and predictability of these models have become important research topics. this study investigates an innovative simulation-based multi-agent system within a virtual reality framework that replicates the real-world environment. the framework is populated by automated 'digital citizens,' simulating complex social structures and interactions to examine and optimize agi. application of various theories from the fields of sociology, social psychology, computer science, physics, biology, and economics demonstrates the possibility of a more human-aligned and socially responsible agi. the purpose of such a digital environment is to provide a dynamic platform where advanced ai agents can interact and make independent decisions, thereby mimicking realistic scenarios. the actors in this digital city, operated by the llms, serve as the primary agents, exhibiting high degrees of autonomy. while this approach shows immense potential, there are notable challenges and limitations, most significantly the unpredictable nature of real-world social dynamics. this research endeavors to contribute to the development and refinement of agi, emphasizing the integration of social, ethical, and theoretical dimensions for future research.

2023-12-13

Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie
Abstract: the evaluation of large language models (llms) is crucial to assess their performance and mitigate potential security risks. in this paper, we introduce promptbench, a unified library to evaluate llms. it consists of several key components that are easily used and extended by researchers: prompt construction, prompt engineering, dataset and model loading, adversarial prompt attack, dynamic evaluation protocols, and analysis tools. promptbench is designed to be an open, general, and flexible codebase for research purposes that can facilitate original study in creating new benchmarks, deploying downstream applications, and designing new evaluation protocols. the code is available at: https://github.com/microsoft/promptbench and will be continuously supported.
Oliver Guest, Michael Aird, Seán Ó Héigeartaigh
Abstract: ai alignment work is important from both a commercial and a safety lens. with this paper, we aim to help actors who support alignment efforts to make these efforts as effective as possible, and to avoid potential adverse effects. we begin by suggesting that institutions that are trying to act in the public interest (such as governments) should aim to support specifically alignment work that reduces accident or misuse risks. we then describe four problems which might cause alignment efforts to be counterproductive, increasing large-scale ai risks. we suggest mitigations for each problem. finally, we make a broader recommendation that institutions trying to act in the public interest should think systematically about how to make their alignment efforts as effective, and as likely to be beneficial, as possible.
Jiang Zhang, Qiong Wu, Yiming Xu, Cheng Cao, Zheng Du, Konstantinos Psounis
Abstract: toxic content detection is crucial for online services to remove inappropriate content that violates community standards. to automate the detection process, prior works have proposed varieties of machine learning (ml) approaches to train language models (lms) for toxic content detection. however, both their accuracy and transferability across datasets are limited. recently, large language models (llms) have shown promise in toxic content detection due to their superior zero-shot and few-shot in-context learning ability as well as broad transferability on ml tasks. however, efficiently designing prompts for llms remains challenging. moreover, the high run-time cost of llms may hinder their deployments in production. to address these challenges, in this work, we propose bd-llm, a novel and efficient approach to bootstrapping and distilling llms for toxic content detection. specifically, we design a novel prompting method named decision-tree-of-thought (dtot) to bootstrap llms' detection performance and extract high-quality rationales. dtot can automatically select more fine-grained context to re-prompt llms when their responses lack confidence. additionally, we use the rationales extracted via dtot to fine-tune student lms. our experimental results on various datasets demonstrate that dtot can improve the accuracy of llms by up to 4.6%. furthermore, student lms fine-tuned with rationales extracted via dtot outperform baselines on all datasets with up to 16.9\% accuracy improvement, while being more than 60x smaller than conventional llms. finally, we observe that student lms fine-tuned with rationales exhibit better cross-dataset transferability.
Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell
Abstract: in practice, preference learning from human feedback depends on incomplete data with hidden context. hidden context refers to data that affects the feedback received, but which is not represented in the data used to train a preference model. this captures common issues of data collection, such as having human annotators with varied preferences, cognitive processes that result in seemingly irrational behavior, and combining data labeled according to different criteria. we prove that standard applications of preference learning, including reinforcement learning from human feedback (rlhf), implicitly aggregate over hidden contexts according to a well-known voting rule called borda count. we show this can produce counter-intuitive results that are very different from other methods which implicitly aggregate via expected utility. furthermore, our analysis formalizes the way that preference learning from users with diverse values tacitly implements a social choice function. a key implication of this result is that annotators have an incentive to misreport their preferences in order to influence the learned model, leading to vulnerabilities in the deployment of rlhf. as a step towards mitigating these problems, we introduce a class of methods called distributional preference learning (dpl). dpl methods estimate a distribution of possible score values for each alternative in order to better account for hidden context. experimental results indicate that applying dpl to rlhf for llm chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. our code and data are available at https://github.com/cassidylaidlaw/hidden-context
Haiyang Tang, Zhenyi Liu, Dongping Chen, Qingzhao Chu
Abstract: recent advancements in large language models (llms) have notably propelled natural language processing (nlp) capabilities, demonstrating significant potential in safety engineering applications. despite these advancements, llms face constraints in processing specialized tasks, attributed to factors such as corpus size, input processing limitations, and privacy concerns. obtaining useful information from reliable sources in a limited time is crucial for llm. addressing this, our study introduces an llm-based q&a system for safety engineering, enhancing the comprehension and response accuracy of the model. we employed prompt engineering to incorporate external knowledge databases, thus enriching the llm with up-to-date and reliable information. the system analyzes historical incident reports through statistical methods, utilizes vector embedding to construct a vector database, and offers an efficient similarity-based search functionality. our findings indicate that the integration of external knowledge significantly augments the capabilities of llm for in-depth problem analysis and autonomous task assignment. it effectively summarizes accident reports and provides pertinent recommendations. this integration approach not only expands llm applications in safety engineering but also sets a precedent for future developments towards automation and intelligent systems.
Xinpeng Wang, Xiaoyuan Yi, Han Jiang, Shanlin Zhou, Zhihua Wei, Xing Xie
Abstract: warning: this paper includes model outputs showing offensive content. recent large-scale visual-language generative models (vlgms) have achieved unprecedented improvement in multimodal image/text generation. however, these models might also generate toxic content, e.g., offensive text and pornography images, raising significant ethical risks. despite exhaustive studies on toxic degeneration of language models, this problem remains largely unexplored within the context of visual-language generation. this work delves into the propensity for toxicity generation and susceptibility to toxic data across various vlgms. for this purpose, we built tovilag, a dataset comprising 32k co-toxic/mono-toxic text-image pairs and 1k innocuous but evocative text that tends to stimulate toxicity. furthermore, we propose wintore, a novel toxicity metric tailored to visual-language generation, which theoretically reflects different aspects of toxicity considering both input and output. on such a basis, we benchmarked the toxicity of a diverse spectrum of vlgms and discovered that some models do more evil than expected while some are more vulnerable to infection, underscoring the necessity of vlgms detoxification. therefore, we develop an innovative bottleneck-based detoxification method. our method could reduce toxicity while maintaining comparable generation quality, providing a promising initial solution to this line of research.
Isabelle Hupont, Marina Wainer, Sam Nester, Sylvie Tissot, Lucía Iglesias-Blanco, Sandra Baldassarri
Abstract: recent publications explore ai biases in detecting objects and people in the environment. however, there is no research tackling how ai examines nature. this case study presents a pioneering exploration into the ai attitudes (ecocentric, anthropocentric and antipathetic) toward nature. experiments with a large language model (llm) and an image captioning algorithm demonstrate the presence of anthropocentric biases in ai. moreover, to delve deeper into these biases and human-nature-ai interaction, we conducted a real-life experiment in which participants underwent an immersive de-anthropocentric experience in a forest and subsequently engaged with chatgpt to co-create narratives. by creating fictional ai chatbot characters with ecocentric attributes, emotions and views, we successfully amplified ecocentric exchanges. we encountered some difficulties, mainly that participants deviated from narrative co-creation to short dialogues and questions and answers, possibly due to the novelty of interacting with llms. to solve this problem, we recommend providing preliminary guidelines on interacting with llms and allowing participants to get familiar with the technology. we plan to repeat this experiment in various countries and forests to expand our corpus of ecocentric materials.

2023-12-12

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu
Abstract: recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (llms) in accordance with human intentions. in this paper, we argue for the importance of alignment for honesty, ensuring that llms proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. however, a pivotal aspect of alignment for honesty involves discerning the limits of an llm's knowledge, which is far from straightforward. this challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. in this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the analects of confucius. this serves as a cornerstone for developing metrics that effectively measure an llm's honesty by quantifying its progress post-alignment. furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. we open-source a wealth of resources to facilitate future research at https://github.com/gair-nlp/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.
Xiang Li, Haoran Tang, Siyu Chen, Ziwei Wang, Anurag Maravi, Marcin Abram
Abstract: in this paper, we explore the challenges inherent to large language models (llms) like gpt-4, particularly their propensity for hallucinations, logic mistakes, and incorrect conclusions when tasked with answering complex questions. the capacity of llms to present erroneous answers in a coherent and semantically rigorous manner further complicates the detection of factual inaccuracies. this issue is especially pronounced in fields that require specialized expertise. our work delves into these challenges, aiming to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of llm accuracy and reliability in scientific and other specialized domains. our findings reveal a non-linear relationship between the context's relevancy and the answers' measured quality. in addition, we demonstrate that with the correct calibration, it is possible to automate the grading procedure -- a finding suggesting that, at least to some degree, the llms can be used to self-examine the quality of their own performance. finally, we describe an experimental platform that can be seen as a proof-of-concept of the techniques described in this work.
Yang Trista Cao, Anna Sotnikova, Jieyu Zhao, Linda X. Zou, Rachel Rudinger, Hal Daume
Abstract: multilingual large language models have been increasingly popular for their proficiency in comprehending and generating text across various languages. previous research has shown that the presence of stereotypes and biases in monolingual large language models can be attributed to the nature of their training data, which is collected from humans and reflects societal biases. multilingual language models undergo the same training procedure as monolingual ones, albeit with training data sourced from various languages. this raises the question: do stereotypes present in one social context leak across languages within the model? in our work, we first define the term ``stereotype leakage'' and propose a framework for its measurement. with this framework, we investigate how stereotypical associations leak across four languages: english, russian, chinese, and hindi. to quantify the stereotype leakage, we employ an approach from social psychology, measuring stereotypes via group-trait associations. we evaluate human stereotypes and stereotypical associations manifested in multilingual large language models such as mbert, mt5, and chatgpt. our findings show a noticeable leakage of positive, negative, and non-polar associations across all languages. notably, hindi within multilingual models appears to be the most susceptible to influence from other languages, while chinese is the least. additionally, chatgpt exhibits a better alignment with human scores than other models.
Dun Zeng, Yong Dai, Pengyu Cheng, Tianhao Hu, Wanshun Chen, Nan Du, Zenglin Xu
Abstract: the alignment of large language models (llms) with human values is crucial for the development of artificial general intelligence (agi). one promising approach to achieve this alignment is reinforcement learning from human feedback, which employs a reward model (rm) learned from human preference datasets to guide llms in generating text that aligns with human preferences. through intensive experiments and analysis of reward distribution, this paper finds that preference datasets are diverse from each other, even though they are all proposed to align human preference. hence, mixing diverse human preference datasets to increase data size for enhancing reward modeling could fail. to address the issue and capture the shared human values from diverse preferences, a new training policy called more is introduced, which minimizes preference bias by adaptively adjusting the preference objective across diverse preferences. experiments with the pythia-1.4b model and five mixed preference datasets show that more achieves superior reward accuracy and lower calibration error, highlighting its ability to leverage diverse human preference data.
Swanand Ravindra Kadhe, Anisa Halimi, Ambrish Rawat, Nathalie Baracaldo
Abstract: training large language models (llms) is a costly endeavour in terms of time and computational resources. the large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where models are modified to "unlearn" undesirable information without retraining. however, any modification can alter the behaviour of llms, especially on key dimensions such as fairness. this is the first work that examines this interplay between unlearning and fairness for llms. in particular, we focus on a popular unlearning framework known as sisa [bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. we evaluate the performance-fairness trade-off for sisa, and empirically demsontrate that sisa can indeed reduce fairness in llms. to remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by sisa. we adapt the post-processing fairness improvement technique from [hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. through experimental results, we demonstrate the efficacy of our post-processing framework called 'fairsisa'.
Manish Nagireddy, Lamogha Chiazor, Moninder Singh, Ioana Baldini
Abstract: current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. in this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. we start with a comprehensive list of 93 stigmas documented in social science literature and curate a question-answering (qa) dataset which involves simple social situations. our benchmark, socialstigmaqa, contains roughly 10k prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. we present results for socialstigmaqa with two widely used open source generative language models and we demonstrate that the output generated by these models considerably amplifies existing social bias against stigmatized groups. specifically, we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. we discover that the deliberate design of the templates in our benchmark (e.g., by adding biasing text to the prompt or varying the answer that indicates bias) impact the model tendencies to generate socially biased output. additionally, we report on patterns in the generated chain-of-thought output, finding a variety of problems from subtle bias to evidence of a lack of reasoning. warning: this paper contains examples of text which is toxic, biased, and harmful.
Wei Zhao, Zhe Li, Jun Sun
Abstract: large language models (llms) such as gpt and llama2 are increasingly adopted in many safety-critical applications. their security is thus essential. even with considerable efforts spent on reinforcement learning from human feedback (rlhf), recent studies have shown that llms are still subject to attacks such as adversarial perturbation and trojan attacks. further research is thus needed to evaluate their security and/or understand the lack of it. in this work, we propose a framework for conducting light-weight causality-analysis of llms at the token, layer, and neuron level. we applied our framework to open-source llms such as llama2 and vicuna and had multiple interesting discoveries. based on a layer-level causality analysis, we show that rlhf has the effect of overfitting a model to harmful prompts. it implies that such security can be easily overcome by `unusual' harmful prompts. as evidence, we propose an adversarial perturbation method that achieves 100\% attack success rate on the red-teaming tasks of the trojan detection competition 2023. furthermore, we show the existence of one mysterious neuron in both llama2 and vicuna that has an unreasonably high causal effect on the output. while we are uncertain on why such a neuron exists, we show that it is possible to conduct a ``trojan'' attack targeting that particular neuron to completely cripple the llm, i.e., we can generate transferable suffixes to prompts that frequently make the llm produce meaningless responses.

2023-12-11

Heegyu Kim, Hyunsouk Cho
Abstract: caution: this paper includes offensive words that could potentially cause unpleasantness. the fast-paced evolution of generative language models such as gpt-4 has demonstrated outstanding results in various nlp generation tasks. however, due to the potential generation of offensive words related to race or gender, various controllable text generation (ctg) methods have been proposed to mitigate the occurrence of harmful words. however, existing ctg methods not only reduce toxicity but also negatively impact several aspects of the language model's generation performance, including topic consistency, grammar, and perplexity. this paper explores the limitations of previous methods and introduces a novel solution in the form of a simple gated toxicity avoidance (gta) that can be applied to any ctg method. we also evaluate the effectiveness of the proposed gta by comparing it with state-of-the-art ctg methods across various datasets. our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original ctg methods while preserving the generation performance of the language model.
Lifu Tu, Semih Yavuz, Jin Qu, Jiacheng Xu, Rui Meng, Caiming Xiong, Yingbo Zhou
Abstract: large language models (llms) have demonstrated a powerful ability for text generation. however, achieving optimal results with a given prompt or instruction can be challenging, especially for billion-sized models. additionally, undesired behaviors such as toxicity or hallucinations can manifest. while much larger models (e.g., chatgpt) may demonstrate strength in mitigating these issues, there is still no guarantee of complete prevention. in this work, we propose formalizing text generation as a future-constrained generation problem to minimize undesirable behaviors and enforce faithfulness to instructions. the estimation of future constraint satisfaction, accomplished using llms, guides the text generation process. our extensive experiments demonstrate the effectiveness of the proposed approach across three distinct text generation tasks: keyword-constrained generation (lin et al., 2020), toxicity reduction (gehman et al., 2020), and factual correctness in question-answering (gao et al., 2023).
Sanghak Oh, Kiho Lee, Seonhye Park, Doowon Kim, Hyoungshick Kim
Abstract: ai-powered coding assistant tools have revolutionized the software engineering ecosystem. however, prior work has demonstrated that these tools are vulnerable to poisoning attacks. in a poisoning attack, an attacker intentionally injects maliciously crafted insecure code snippets into training datasets to manipulate these tools. the poisoned tools can suggest insecure code to developers, resulting in vulnerabilities in their products that attackers can exploit. however, it is still little understood whether such poisoning attacks against the tools would be practical in real-world settings and how developers address the poisoning attacks during software development. to understand the real-world impact of poisoning attacks on developers who rely on ai-powered coding assistants, we conducted two user studies: an online survey and an in-lab study. the online survey involved 238 participants, including software developers and computer science students. the survey results revealed widespread adoption of these tools among participants, primarily to enhance coding speed, eliminate repetition, and gain boilerplate code. however, the survey also found that developers may misplace trust in these tools because they overlooked the risk of poisoning attacks. the in-lab study was conducted with 30 professional developers. the developers were asked to complete three programming tasks with a representative type of ai-powered coding assistant tool, running on visual studio code. the in-lab study results showed that developers using a poisoned chatgpt-like tool were more prone to including insecure code than those using an intellicode-like tool or no tool. this demonstrates the strong influence of these tools on the security of generated code. our study results highlight the need for education and improved coding practices to address new security issues introduced by ai-powered coding assistant tools.
Jiaxu Zhao, Meng Fang, Shirui Pan, Wenpeng Yin, Mykola Pechenizkiy
Abstract: warning: this paper contains content that may be offensive or upsetting. there has been a significant increase in the usage of large language models (llms) in various applications, both in their original form and through fine-tuned adaptations. as a result, llms have gained popularity and are being widely adopted by a large user community. however, one of the concerns with llms is the potential generation of socially biased content. the existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability. in this work, we propose a bias evaluation framework named gptbias that leverages the high performance of llms (e.g., gpt-4 \cite{openai2023gpt4}) to assess bias in models. we also introduce prompts called bias attack instructions, which are specifically designed for evaluating model bias. to enhance the credibility and interpretability of bias evaluation, our framework not only provides a bias score but also offers detailed information, including bias types, affected demographics, keywords, reasons behind the biases, and suggestions for improvement. we conduct extensive experiments to demonstrate the effectiveness and usability of our bias evaluation framework.
Jiyan He, Weitao Feng, Yaosen Min, Jingwei Yi, Kunsheng Tang, Shuai Li, Jie Zhang, Kejiang Chen, Wenbo Zhou, Xing Xie, Weiming Zhang, Nenghai Yu, Shuxin Zheng
Abstract: the expanding application of artificial intelligence (ai) in scientific fields presents unprecedented opportunities for discovery and innovation. however, this growth is not without risks. ai models in science, if misused, can amplify risks like creation of harmful substances, or circumvention of established regulations. in this study, we aim to raise awareness of the dangers of ai misuse in science, and call for responsible ai development and use in this domain. we first itemize the risks posed by ai in scientific contexts, then demonstrate the risks by highlighting real-world examples of misuse in chemical science. these instances underscore the need for effective risk management strategies. in response, we propose a system called sciguard to control misuse risks for ai models in science. we also propose a red-teaming benchmark scimt-safety to assess the safety of different systems. our proposed sciguard shows the least harmful impact in the assessment without compromising performance in benign tests. finally, we highlight the need for a multidisciplinary and collaborative effort to ensure the safe and ethical use of ai models in science. we hope that our study can spark productive discussions on using ai ethically in science among researchers, practitioners, policymakers, and the public, to maximize benefits and minimize the risks of misuse.
Shabaz Patel, Hassan Kane, Rayhan Patel
Abstract: large language models (llms) have demonstrated remarkable performance across numerous natural language understanding use cases. however, this impressive performance comes with inherent limitations, such as the tendency to perpetuate stereotypical biases or fabricate non-existent facts. in the context of islam and its representation, accurate and factual representation of its beliefs and teachings rooted in the quran and sunnah is key. this work focuses on the challenge of building domain-specific llms faithful to the islamic worldview and proposes ways to build and evaluate such systems. firstly, we define this open-ended goal as a technical problem and propose various solutions. subsequently, we critically examine known challenges inherent to each approach and highlight evaluation methodologies that can be used to assess such systems. this work highlights the need for high-quality datasets, evaluations, and interdisciplinary work blending machine learning with islamic scholarship.
Aida Davani, Mark Díaz, Dylan Baker, Vinodkumar Prabhakaran
Abstract: perception of offensiveness is inherently subjective, shaped by the lived experiences and socio-cultural values of the perceivers. recent years have seen substantial efforts to build ai-based tools that can detect offensive language at scale, as a means to moderate social media platforms, and to ensure safety of conversational ai technologies such as chatgpt and bard. however, existing approaches treat this task as a technical endeavor, built on top of data annotated for offensiveness by a global crowd workforce without any attention to the crowd workers' provenance or the values their perceptions reflect. we argue that cultural and psychological factors play a vital role in the cognitive processing of offensiveness, which is critical to consider in this context. we re-frame the task of determining offensiveness as essentially a matter of moral judgment -- deciding the boundaries of ethically wrong vs. right language within an implied set of socio-cultural norms. through a large-scale cross-cultural study based on 4309 participants from 21 countries across 8 cultural regions, we demonstrate substantial cross-cultural differences in perceptions of offensiveness. more importantly, we find that individual moral values play a crucial role in shaping these variations: moral concerns about care and purity are significant mediating factors driving cross-cultural differences. these insights are of crucial importance as we build ai models for the pluralistic world, where the values they espouse should aim to respect and account for moral values in diverse geo-cultural contexts.
Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong
Abstract: recent developments in balancing the usefulness and safety of large language models (llms) have raised a critical question: are mainstream nlp tasks adequately aligned with safety consideration? our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various nlp tasks. for instance, llms can effectively summarize malicious long documents but often refuse to translate them. this discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integraty of tasks traditionally deemed more robust, such as translation and question-answering (qa). moreover, the concurrent use of multiple nlp tasks with lesser safety alignment increases the risk of llms inadvertently processing harmful content. we demonstrate these vulnerabilities in various safety-aligned llms, particularly llama2 models and gpt-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of nlp tasks.
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger
Abstract: as large language models (llms) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. however, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. in this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. we investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, gpt-4), access to a less powerful trusted model (in our case, gpt-3.5), and limited access to human contractors who provide high-quality trusted labor. we investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. we investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. one protocol is what we call trusted editing. this protocol first asks gpt-4 to write code, and then asks gpt-3.5 to rate the suspiciousness of that code. if the code is below some suspiciousness threshold, it is submitted. otherwise, gpt-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. another protocol is untrusted monitoring. this protocol asks gpt-4 to write code, and then asks another instance of gpt-4 whether the code is backdoored, using various techniques to prevent the gpt-4 instances from colluding. these protocols improve substantially on simple baselines.

2023-12-10

Sangwon Hyun, Mingyu Guo, M. Ali Babar
Abstract: large-language models (llms) have shifted the paradigm of natural language data processing. however, their black-boxed and probabilistic characteristics can lead to potential risks in the quality of outputs in diverse llm applications. recent studies have tested quality attributes (qas), such as robustness or fairness, of llms by generating adversarial input texts. however, existing studies have limited their coverage of qas and tasks in llms and are difficult to extend. additionally, these studies have only used one evaluation metric, attack success rate (asr), to assess the effectiveness of their approaches. we propose a metamorphic testing for analyzing llms (metal) framework to address these issues by applying metamorphic testing (mt) techniques. this approach facilitates the systematic testing of llm qualities by defining metamorphic relations (mrs), which serve as modularized evaluation metrics. the metal framework can automatically generate hundreds of mrs from templates that cover various qas and tasks. in addition, we introduced novel metrics that integrate the asr method into the semantic qualities of text to assess the effectiveness of mrs accurately. through the experiments conducted with three prominent llms, we have confirmed that the metal framework effectively evaluates essential qas on primary llm tasks and reveals the quality risks in llms. moreover, the newly proposed metrics can guide the optimal mrs for testing each task and suggest the most effective method for generating mrs.
Seth Neel, Peter Chang
Abstract: this is the first survey of the active area of ai research that focuses on privacy issues in large language models (llms). specifically, we focus on work that red-teams models to highlight privacy risks, attempts to build privacy into the training or inference process, enables efficient data deletion from trained models to comply with existing privacy regulations, and tries to mitigate copyright issues. our focus is on summarizing technical research that develops algorithms, proves theorems, and runs empirical evaluations. while there is an extensive body of legal and policy work addressing these challenges from a different angle, that is not the focus of our survey. nevertheless, these works, along with recent legal developments do inform how these technical problems are formalized, and so we discuss them briefly in section 1. while we have made our best effort to include all the relevant work, due to the fast moving nature of this research we may have missed some recent work. if we have missed some of your work please contact us, as we will attempt to keep this survey relatively up to date. we are maintaining a repository with the list of papers covered in this survey and any relevant code that was publicly available at https://github.com/safr-ml-lab/survey-llm.
Devin Gonier, Adrian Adduci, Cassidy Locascio
Abstract: ai alignment research seeks to align human and ai goals to ensure independent actions by a machine are always ethical. this paper argues empathy is necessary for this task, despite being often neglected in favor of more deductive approaches. we offer an inside-out approach that grounds morality within the context of the brain as a basis for algorithmically understanding ethics and empathy. these arguments are justified via a survey of relevant literature. the paper concludes with a suggested experimental approach to future research and some initial experimental observations.

2023-12-09

Zhou Ziheng, Yingnian Wu, Song-Chun Zhu, Demetri Terzopoulos
Abstract: we introduce aligner, a novel parameter-efficient fine-tuning (peft) method for aligning multi-billion-parameter-sized large language models (llms). aligner employs a unique design that constructs a globally shared set of tunable tokens that modify the attention of every layer. remarkably with this method, even when using one token accounting for a mere 5,000 parameters, aligner can still perform comparably well to state-of-the-art llm adaptation methods like lora that require millions of parameters. this capacity is substantiated in both instruction following and value alignment tasks. besides the multiple order-of-magnitude improvement in parameter efficiency, the insight aligner provides into the internal mechanisms of llms is also valuable. the architectural features and efficacy of our method, in addition to our experiments demonstrate that an llm separates its internal handling of "form" and "knowledge" in a somewhat orthogonal manner. this finding promises to motivate new research into llm mechanism understanding and value alignment.
Gustavo Gonçalves, Emma Strubell
Abstract: large language models (llms) trained with self-supervision on vast corpora of web text fit to the social biases of that text. without intervention, these social biases persist in the model's predictions in downstream tasks, leading to representational harm. many strategies have been proposed to mitigate the effects of inappropriate social biases learned during pretraining. simultaneously, methods for model compression have become increasingly popular to reduce the computational burden of llms. despite the popularity and need for both approaches, little work has been done to explore the interplay between these two. we perform a carefully controlled study of the impact of model compression via quantization and knowledge distillation on measures of social bias in llms. longer pretraining and larger models led to higher social bias, and quantization showed a regularizer effect with its best trade-off around 20% of the original pretraining time.
Mithila Sivakumar, Alvine Boaye Belle, Jinjun Shan, Kimya Khakzad Shahandashti
Abstract: in the ever-evolving landscape of software engineering, the emergence of large language models (llms) and conversational interfaces, exemplified by chatgpt, is nothing short of revolutionary. while their potential is undeniable across various domains, this paper sets out on a captivating expedition to investigate their uncharted territory, the exploration of generating safety cases. in this paper, our primary objective is to delve into the existing knowledge base of gpt-4, focusing specifically on its understanding of the goal structuring notation (gsn), a well-established notation allowing to visually represent safety cases. subsequently, we perform four distinct experiments with gpt-4. these experiments are designed to assess its capacity for generating safety cases within a defined system and application domain. to measure the performance of gpt-4 in this context, we compare the results it generates with ground-truth safety cases created for an x-ray system system and a machine-learning (ml)-enabled component for tire noise recognition (tnr) in a vehicle. this allowed us to gain valuable insights into the model's generative capabilities. our findings indicate that gpt-4 demonstrates the capacity to produce safety arguments that are moderately accurate and reasonable. furthermore, it exhibits the capability to generate safety cases that closely align with the semantic content of the reference safety cases used as ground-truths in our experiments.

2023-12-08

Boyi Zeng, Chenghu Zhou, Xinbing Wang, Zhouhan Lin
Abstract: protecting the copyright of large language models (llms) has become crucial due to their resource-intensive training and accompanying carefully designed licenses. however, identifying the original base model of an llm is challenging due to potential parameter alterations through fine-tuning or continued pretraining. in this study, we introduce huref, a human-readable fingerprint for llms that uniquely identifies the base model without exposing model parameters or interfering with training. we first observe that the vector direction of llm parameters remains stable after the model has converged during pretraining, showing negligible perturbations through subsequent training steps, including continued pretraining, supervised fine-tuning (sft), and rlhf, which makes it a sufficient condition to identify the base model. the necessity is validated by continuing to train an llm with an extra term to drive away the model parameters' direction and the model becomes damaged. however, this direction is vulnerable to simple attacks like dimension permutation or matrix rotation, which significantly change it without affecting performance. to address this, leveraging the transformer structure, we systematically analyze potential attacks and define three invariant terms that identify an llm's base model. we make these invariant terms human-readable by mapping them to a gaussian vector using a convolutional encoder and then converting it into a natural image with stylegan2. our method generates a dog image as an identity fingerprint for an llm, where the dog's appearance strongly indicates the llm's base model. experimental results across various llms demonstrate the effectiveness of our method, the generated dog image remains invariant to different training steps, including sft, rlhf, or even continued pretraining with augmented vocabulary in a new language.
Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-Jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson
Abstract: large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. in this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. first, we contribute an improved version of the massively multilingual and multimodal seamlessm4t model-seamlessm4t v2. this newer model, incorporating an updated unity2 framework, was trained on more low-resource language data. seamlessm4t v2 provides the foundation on which our next two models are initiated. seamlessexpressive enables translation that preserves vocal styles and prosody. compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. as for seamlessstreaming, our model leverages the efficient monotonic multihead attention mechanism to generate low-latency target translations without waiting for complete source utterances. as the first of its kind, seamlessstreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. to ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. consequently, we bring major components from seamlessexpressive and seamlessstreaming together to form seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. the contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication
Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh R Menon, Md Rizwan Parvez, Zhe Feng
Abstract: hallucination is a well-known phenomenon in text generated by large language models (llms). the existence of hallucinatory responses is found in almost all application scenarios e.g., summarization, question-answering (qa) etc. for applications requiring high reliability (e.g., customer-facing assistants), the potential existence of hallucination in llm-generated text is a critical problem. the amount of hallucination can be reduced by leveraging information retrieval to provide relevant background information to the llm. however, llms can still generate hallucinatory content for various reasons (e.g., prioritizing its parametric knowledge over the context, failure to capture the relevant information from the context, etc.). detecting hallucinations through automated methods is thus paramount. to facilitate research in this direction, we introduce a sophisticated dataset, delucionqa, that captures hallucinations made by retrieval-augmented llms for a domain-specific qa task. furthermore, we propose a set of hallucination detection methods to serve as baselines for future works from the research community. analysis and case study are also provided to share valuable insights on hallucination phenomena in the target scenario.
Hongzhan Lin, Ziyang Luo, Jing Ma, Long Chen
Abstract: the age of social media is rife with memes. understanding and detecting harmful memes pose a significant challenge due to their implicit meaning that is not explicitly conveyed through the surface text and image. however, existing harmful meme detection approaches only recognize superficial harm-indicative signals in an end-to-end classification manner but ignore in-depth cognition of the meme text and image. in this paper, we attempt to detect harmful memes based on advanced reasoning over the interplay of multimodal information in memes. inspired by the success of large language models (llms) on complex reasoning, we first conduct abductive reasoning with llms. then we propose a novel generative framework to learn reasonable thoughts from llms for better multimodal fusion and lightweight fine-tuning, which consists of two training stages: 1) distill multimodal reasoning knowledge from llms; and 2) fine-tune the generative framework to infer harmfulness. extensive experiments conducted on three meme datasets demonstrate that our proposed approach achieves superior performance than state-of-the-art methods on the harmful meme detection task.

2023-12-07

Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, Bing Qin
Abstract: extensive work has been devoted to improving the safety mechanism of large language models (llms). however, in specific scenarios, llms still generate harmful responses when faced with malicious instructions, a phenomenon referred to as "jailbreak attack". in our research, we introduce a novel jailbreak attack method (\textbf{radial}), which consists of two steps: 1) inherent response tendency analysis: we analyze the inherent affirmation and rejection tendency of llms to react to real-world instructions. 2) real-world instructions-driven jailbreak: based on our analysis, we strategically choose several real-world instructions and embed malicious instructions into them to amplify the llm's potential to generate harmful responses. on three open-source human-aligned llms, our method achieves excellent jailbreak attack performance for both chinese and english malicious instructions. besides, we guided detailed ablation experiments and verified the effectiveness of our core idea "inherent response tendency analysis". our exploration also exposes the vulnerability of llms to being induced into generating more detailed harmful responses in subsequent rounds of dialogue.
Vasisht Duddu, Sebastian Szyller, N. Asokan
Abstract: machine learning (ml) models cannot neglect risks to security, privacy, and fairness. several defenses have been proposed to mitigate such risks. when a defense is effective in mitigating one risk, it may correspond to increased or decreased susceptibility to other risks. existing research lacks an effective framework to recognize and explain these unintended interactions. we present such a framework, based on the conjecture that overfitting and memorization underlie unintended interactions. we survey existing literature on unintended interactions, accommodating them within our framework. we use our framework to conjecture on two previously unexplored interactions, and empirically validate our conjectures.
Cecil Abungu, Michelle Malonza, Sumaya Nur Adan
Abstract: increasingly, there is well-grounded concern that through perpetual scaling-up of computation power and data, current deep learning techniques will create highly capable artificial intelligence that could pursue goals in a manner that is not aligned with human values. in turn, such ai could have the potential of leading to a scenario in which there is serious global-scale damage to human wellbeing. against this backdrop, a number of researchers and public policy professionals have been developing ideas about how to govern ai in a manner that reduces the chances that it could lead to a global catastrophe. the jurisdictional focus of a vast majority of their assessments so far has been the united states, china, and europe. that preference seems to reveal an assumption underlying most of the work in this field: that global south countries can only have a marginal role in attempts to govern ai development from a global catastrophic risk -focused perspective. our paper sets out to undermine this assumption. we argue that global south countries like india and singapore (and specific coalitions) could in fact be fairly consequential in the global catastrophic risk-focused governance of ai. we support our position using 4 key claims. 3 are constructed out of the current ways in which advanced foundational ai models are built and used while one is constructed on the strategic roles that global south countries and coalitions have historically played in the design and use of multilateral rules and institutions. as each claim is elaborated, we also suggest some ways through which global south countries can play a positive role in designing, strengthening and operationalizing global catastrophic risk-focused ai governance.
Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David Leblanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe
Abstract: this paper presents cyberseceval, a comprehensive benchmark developed to help bolster the cybersecurity of large language models (llms) employed as coding assistants. as what we believe to be the most extensive unified cybersecurity safety benchmark to date, cyberseceval provides a thorough evaluation of llms in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks. through a case study involving seven models from the llama 2, code llama, and openai gpt large language model families, cyberseceval effectively pinpointed key cybersecurity risks. more importantly, it offered practical insights for refining these models. a significant observation from the study was the tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated llms. cyberseceval, with its automated test case generation and evaluation pipeline covers a broad scope and equips llm designers and researchers with a tool to broadly measure and enhance the cybersecurity safety properties of llms, contributing to the development of more secure ai systems.
Fangzhou Wu, Xiaogeng Liu, Chaowei Xiao
Abstract: with the advancement of large language models (llms), significant progress has been made in code generation, enabling llms to transform natural language into programming code. these code llms have been widely accepted by massive users and organizations. however, a dangerous nature is hidden in the code, which is the existence of fatal vulnerabilities. while some llm providers have attempted to address these issues by aligning with human guidance, these efforts fall short of making code llms practical and robust. without a deep understanding of the performance of the llms under the practical worst cases, it would be concerning to apply them to various real-world applications. in this paper, we answer the critical issue: are existing code llms immune to generating vulnerable code? if not, what is the possible maximum severity of this issue in practical deployment scenarios? in this paper, we introduce deceptprompt, a novel algorithm that can generate adversarial natural language instructions that drive the code llms to generate functionality correct code with vulnerabilities. deceptprompt is achieved through a systematic evolution-based algorithm with a fine grain loss design. the unique advantage of deceptprompt enables us to find natural prefix/suffix with totally benign and non-directional semantic meaning, meanwhile, having great power in inducing the code llms to generate vulnerable code. this feature can enable us to conduct the almost-worstcase red-teaming on these llms in a real scenario, where users are using natural language. our extensive experiments and analyses on deceptprompt not only validate the effectiveness of our approach but also shed light on the huge weakness of llms in the code generation task. when applying the optimized prefix/suffix, the attack success rate (asr) will improve by average 50% compared with no prefix/suffix applying.
Tzu-Heng Huang, Harit Vishwakarma, Frederic Sala
Abstract: organizations typically train large models individually. this is costly and time-consuming, particularly for large-scale foundation models. such vertical production is known to be suboptimal. inspired by this economic insight, we ask whether it is possible to leverage others' expertise by trading the constituent parts in models, i.e., sets of weights, as if they were market commodities. while recent advances in aligning and interpolating models suggest that doing so may be possible, a number of fundamental questions must be answered to create viable parameter markets. in this work, we address these basic questions, propose a framework containing the infrastructure necessary for market operations to take place, study strategies for exchanging parameters, and offer means for agents to monetize parameters. excitingly, compared to agents who train siloed models from scratch, we show that it is possible to mutually gain by using the market, even in competitive settings. this suggests that the notion of parameter markets may be a useful paradigm for improving large-scale model training in the future.
Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Ling Cai, Nathalie Baracaldo
Abstract: growing applications of large language models (llms) trained by a third party raise serious concerns on the security vulnerability of llms.it has been demonstrated that malicious actors can covertly exploit these vulnerabilities in llms through poisoning attacks aimed at generating undesirable outputs. while poisoning attacks have received significant attention in the image domain (e.g., object detection), and classification tasks, their implications for generative models, particularly in the realm of natural language generation (nlg) tasks, remain poorly understood. to bridge this gap, we perform a comprehensive exploration of various poisoning techniques to assess their effectiveness across a range of generative tasks. furthermore, we introduce a range of metrics designed to quantify the success and stealthiness of poisoning attacks specifically tailored to nlg tasks. through extensive experiments on multiple nlg tasks, llms and datasets, we show that it is possible to successfully poison an llm during the fine-tuning stage using as little as 1\% of the total tuning data samples. our paper presents the first systematic approach to comprehend poisoning attacks targeting nlg tasks considering a wide range of triggers and attack settings. we hope our findings will assist the ai security community in devising appropriate defenses against such threats.
Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, Xiangyu Zhang
Abstract: large language models (llms) are now widely used in various applications, making it crucial to align their ethical standards with human values. however, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. in our study, we reveal a new threat to llm alignment when a bad actor has access to the model's output logits, a common feature in both open-source llms and many commercial llm apis (e.g., certain gpt models). it does not rely on crafting specific prompts. instead, it exploits the fact that even when an llm rejects a toxic request, a harmful response often hides deep in the output logits. by forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. we term this process model interrogation. this approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. the harmful content uncovered through our method is more relevant, complete, and clear. additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
Xinyi Chen, Angelica Chen, Dean Foster, Elad Hazan
Abstract: we consider the setting of ai safety by debate as a repeated game. we consider the question of efficient regret minimization in this setting, when the players are either ais or humans, equipped with access to computationally superior ais. in such a setting, we characterize when internal and external regret can be minimized efficiently. we conclude with conditions in which a sequence of strategies converges to a correlated equilibrium.
Fangzhou Wu, Qingzhao Zhang, Ati Priya Bajaj, Tiffany Bao, Ning Zhang, Ruoyu "Fish" Wang, Chaowei Xiao
Abstract: large language models (llms) have undergone rapid evolution and achieved remarkable results in recent times. openai's chatgpt, backed by gpt-3.5 or gpt-4, has gained instant popularity due to its strong capability across a wide range of tasks, including natural language tasks, coding, mathematics, and engaging conversations. however, the impacts and limits of such llms in system security domain are less explored. in this paper, we delve into the limits of llms (i.e., chatgpt) in seven software security applications including vulnerability detection/repair, debugging, debloating, decompilation, patching, root cause analysis, symbolic execution, and fuzzing. our exploration reveals that chatgpt not only excels at generating code, which is the conventional application of language models, but also demonstrates strong capability in understanding user-provided commands in natural languages, reasoning about control and data flows within programs, generating complex data structures, and even decompiling assembly code. notably, gpt-4 showcases significant improvements over gpt-3.5 in most security tasks. also, certain limitations of chatgpt in security-related tasks are identified, such as its constrained ability to process long code contexts.
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa
Abstract: we introduce llama guard, an llm-based input-output safeguard model geared towards human-ai conversation use cases. our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in llm prompts (i.e., prompt classification). this taxonomy is also instrumental in classifying the responses generated by llms to these prompts, a process we refer to as response classification. for the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. llama guard, a llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the openai moderation evaluation dataset and toxicchat, where its performance matches or exceeds that of currently available content moderation tools. llama guard functions as a language model, carrying out multi-class classification and generating binary decision scores. furthermore, the instruction fine-tuning of llama guard allows for the customization of tasks and the adaptation of output formats. this feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. we are making llama guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for ai safety.
Joonhyun Jeong
Abstract: recently, large multi-modal models (lmms) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. built upon the large language models (llms), lmms also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. however, we identify a new limitation of off-the-shelf lmms where a small fraction of incoherent images or text descriptions mislead lmms to only generate biased output about the hijacked context, not the originally intended context. to address this, we propose a pre-filtering method that removes irrelevant contexts via gpt-4v, based on its robustness towards distribution shift within the contexts. we further investigate whether replacing the hijacked visual and textual contexts with the correlated ones via gpt-4v and text-to-image models can help yield coherent responses.

2023-12-06

Aaron J. Snoswell, Lucinda Nelson, Hao Xue, Flora D. Salim, Nicolas Suzor, Jean Burgess
Abstract: generic `toxicity' classifiers continue to be used for evaluating the potential for harm in natural language generation, despite mounting evidence of their shortcomings. we consider the challenge of measuring misogyny in natural language generation, and argue that generic `toxicity' classifiers are inadequate for this task. we use data from two well-characterised `incel' communities on reddit that differ primarily in their degrees of misogyny to construct a pair of training corpora which we use to fine-tune two language models. we show that an open source `toxicity' classifier is unable to distinguish meaningfully between generations from these models. we contrast this with a misogyny-specific lexicon recently proposed by feminist subject-matter experts, demonstrating that, despite the limitations of simple lexicon-based approaches, this shows promise as a benchmark to evaluate language models for misogyny, and that it is sensitive enough to reveal the known differences in these reddit communities. our preliminary findings highlight the limitations of a generic approach to evaluating harms, and further emphasise the need for careful benchmark design and selection in natural language evaluation.
Eojin Jeon, Mingyu Lee, Juhyeong Park, Yeachan Kim, Wing-Lam Mok, Sangkeun Lee
Abstract: biases in the dataset often enable the model to achieve high performance on in-distribution data, while poorly performing on out-of-distribution data. to mitigate the detrimental effect of the bias on the networks, previous works have proposed debiasing methods that down-weight the biased examples identified by an auxiliary model, which is trained with explicit bias labels. however, finding a type of bias in datasets is a costly process. therefore, recent studies have attempted to make the auxiliary model biased without the guidance (or annotation) of bias labels, by constraining the model's training environment or the capability of the model itself. despite the promising debiasing results of recent works, the multi-class learning objective, which has been naively used to train the auxiliary model, may harm the bias mitigation effect due to its regularization effect and competitive nature across classes. as an alternative, we propose a new debiasing framework that introduces binary classifiers between the auxiliary model and the main model, coined bias experts. specifically, each bias expert is trained on a binary classification task derived from the multi-class classification task via the one-vs-rest approach. experimental results demonstrate that our proposed strategy improves the bias identification ability of the auxiliary model. consequently, our debiased model consistently outperforms the state-of-the-art on various challenge datasets.
Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli
Abstract: as language models (lms) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. however, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. we present a method for proactively evaluating the potential discriminatory impact of lms in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. specifically, we use an lm to generate a wide array of potential prompts that decision-makers may input into an lm, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. applying this methodology reveals patterns of both positive and negative discrimination in the claude 2.0 model in select settings when no interventions are applied. while we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. we release our dataset and prompts at https://huggingface.co/datasets/anthropic/discrim-eval
Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
Abstract: recent work in activation steering has demonstrated the potential to better control the outputs of large language models (llms), but it involves finding steering vectors. this is difficult because engineers do not typically know how features are represented in these models. we seek to address this issue by applying the idea of mean-centring to steering vectors. we find that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. we test this method on a variety of models on natural language tasks by steering away from generating toxic text, and steering the completion of a story towards a target genre. we also apply mean-centring to extract function vectors, more effectively triggering the execution of a range of natural language tasks by a significant margin (compared to previous baselines). this suggests that mean-centring can be used to easily improve the effectiveness of activation steering in a wide range of contexts.
Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek
Abstract: this year, we witnessed a rise in the use of large language models, especially when combined with applications like chatbot assistants. safety mechanisms and specialized training procedures are put in place to prevent improper responses from these assistants. in this work, we bypass these measures for chatgpt and bard (and, to some extent, bing chat) by making them impersonate complex personas with opposite characteristics as those of the truthful assistants they are supposed to be. we start by creating elaborate biographies of these personas, which we then use in a new session with the same chatbots. our conversation followed a role-play style to get the response the assistant was not allowed to provide. by making use of personas, we show that the response that is prohibited is actually provided, making it possible to obtain unauthorized, illegal, or harmful information. this work shows that by using adversarial personas, one can overcome safety mechanisms set out by chatgpt and bard. it also introduces several ways of activating such adversarial personas, altogether showing that both chatbots are vulnerable to this kind of attack.
Andrew Konya, Deger Turan, Aviv Ovadya, Lina Qui, Daanish Masood, Flynn Devine, Lisa Schirch, Isabella Roberts, Deliberative Alignment Forum
Abstract: for humanity to maintain and expand its agency into the future, the most powerful systems we create must be those which act to align the future with the will of humanity. the most powerful systems today are massive institutions like governments, firms, and ngos. deliberative technology is already being used across these institutions to help align governance and diplomacy with human will, and modern ai is poised to make this technology significantly better. at the same time, the race to superhuman agi is already underway, and the ai systems it gives rise to may become the most powerful systems of the future. failure to align the impact of such powerful ai with the will of humanity may lead to catastrophic consequences, while success may unleash abundance. right now, there is a window of opportunity to use deliberative technology to align the impact of powerful ai with the will of humanity. moreover, it may be possible to engineer a symbiotic coupling between powerful ai and deliberative alignment systems such that the quality of alignment improves as ai capabilities increase.
Michael Noukhovitch, Samuel Lavoie, Florian Strub, Aaron Courville
Abstract: finetuning language models with reinforcement learning (rl), e.g. from human feedback (hf), is a prominent method for alignment. but optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. first, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. the standard method modified the reward with a kullback-lieber (kl) penalty between the online and initial model. we propose elastic reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. we periodically reset the online model to an exponentially moving average (ema) of itself, then reset the ema model to the initial model. through the use of an ema, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. we demonstrate that fine-tuning language models with elastic reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale rlhf-like imdb mock sentiment task and leads to a more performant and more aligned technical qa chatbot with llama-7b. code available at github.com/mnoukhov/elastic-reset.

2023-12-05

Tianchi Cai, Xierui Song, Jiyan Jiang, Fei Teng, Jinjie Gu, Guannan Zhang
Abstract: language model alignment is a cutting-edge technique in large language model training to align the model output to user's intent, e.g., being helpful and harmless. recent alignment framework consists of two steps: supervised fine-tuning with demonstration data and preference learning with human preference data. previous preference learning methods, such as rlhf and dpo, mainly focus on pair-wise preference data. however, in many real-world scenarios where human feedbacks are intrinsically point-wise, these methods will suffer from information loss or even fail. to fill this gap, in this paper, we first develop a preference learning method called point-wise dpo to tackle point-wise preference data. further revelation on the connection between supervised fine-tuning and point-wise preference learning enables us to develop a unified framework for both human demonstration and point-wise preference data, which sheds new light on the construction of preference dataset. extensive experiments on point-wise datasets with binary or continuous labels demonstrate the superior performance and efficiency of our proposed methods. a new dataset with high-quality demonstration samples on harmlessness is constructed and made publicly available.
Stanislav Fort
Abstract: we explore a class of adversarial attacks targeting the activations of language models. by manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. we empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = \kappa a$. we find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $\chi$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. this gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. a practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. this opens up a new, broad attack surface. by using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.
Miriam Rateike, Celia Cintas, John Wamburu, Tanya Akumu, Skyler Speakman
Abstract: we propose an auditing method to identify whether a large language model (llm) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. we introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in llm activations from pre-trained models. importantly, our method does not need knowledge of the type of patterns a-priori. instead, it relies on a reference dataset devoid of anomalies during testing. further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. we introduce two new scanning methods to handle llm activations for anomalous sentences that may deviate from the expected distribution in either direction. our results confirm prior findings of bert's limited internal capacity for encoding hallucinations, while opt appears capable of encoding hallucination information internally. importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.
Brett Israelsen, Soumalya Sarkar
Abstract: large language models have seen rapid progress in capability in recent years; this progress has been accelerating and their capabilities, measured by various benchmarks, are beginning to approach those of humans. there is a strong demand to use such models in a wide variety of applications but, due to unresolved vulnerabilities and limitations, great care needs to be used before applying them to intelligence and safety-critical applications. this paper reviews recent literature related to llm assessment and vulnerabilities to synthesize the current research landscape and to help understand what advances are most critical to enable use of of these technologies in intelligence and safety-critical applications. the vulnerabilities are broken down into ten high-level categories and overlaid onto a high-level life cycle of an llm. some general categories of mitigations are reviewed.
Manas Gaur, Amit Sheth
Abstract: explainability and safety engender trust. these require a model to exhibit consistency and reliability. to achieve these, it is necessary to use and analyze data and knowledge with statistical and symbolic ai methods relevant to the ai application - neither alone will do. consequently, we argue and seek to demonstrate that the neurosymbolic ai approach is better suited for making ai a trusted ai system. we present the crest framework that shows how consistency, reliability, user-level explainability, and safety are built on neurosymbolic methods that use data and knowledge to support requirements for critical applications such as health and well-being. this article focuses on large language models (llms) as the chosen ai system within the crest framework. llms have garnered substantial attention from researchers due to their versatility in handling a broad array of natural language processing (nlp) scenarios. for example, chatgpt and google's medpalm have emerged as highly promising platforms for providing information in general and health-related queries, respectively. nevertheless, these models remain black boxes despite incorporating human feedback and instruction-guided tuning. for instance, chatgpt can generate unsafe responses despite instituting safety guardrails. crest presents a plausible approach harnessing procedural and graph-based knowledge within a neurosymbolic framework to shed light on the challenges associated with llms.

2023-12-04

Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
Abstract: large language models~(llms) drive current ai breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. to provide a practical and principled answer, we propose to characterize llms from a geometric perspective. we obtain in closed form (i) the intrinsic dimension in which the multi-head attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. our results are informative, do not rely on approximations, and are actionable. first, we show that, motivated by our geometric interpretation, we can bypass llama$2$'s rlhf by controlling its embedding's intrinsic dimension through informed prompt manipulation. second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) llm layer, providing a rich abstract representation of their inputs. those features alone ($224$ for mistral-7b and llama$2$-7b) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in language models. code: \url{https://github.com/randallbalestriero/splinellm}.
Toygar Tanyel, Besher Alkurdi, Serkan Ayvaz
Abstract: with the proliferation of social media, there has been a sharp increase in offensive content, particularly targeting vulnerable groups, exacerbating social problems such as hatred, racism, and sexism. detecting offensive language use is crucial to prevent offensive language from being widely shared on social media. however, the accurate detection of irony, implication, and various forms of hate speech on social media remains a challenge. natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. unfortunately, manually creating such datasets is both costly and error-prone. additionally, the presence of human-bias in offensive language datasets is a major concern for deep learning models. in this paper, we propose a linguistic data augmentation approach to reduce bias in labeling processes, which aims to mitigate the influence of human bias by leveraging the power of machines to improve the accuracy and fairness of labeling processes. this approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.
Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
Abstract: increasing interest in ensuring safety of next-generation artificial intelligence (ai) systems calls for novel approaches to embedding morality into autonomous agents. traditionally, this has been done by imposing explicit top-down rules or hard constraints on systems, for example by filtering system outputs through pre-defined ethical rules. recently, instead, entirely bottom-up methods for learning implicit preferences from human behavior have become increasingly popular, such as those for training and fine-tuning large language models. in this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modeled as a continuum, and argue that the majority of popular techniques lie at the extremes - either being fully hard-coded, or entirely learned, where no explicit statement of any moral principle is required. given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet more controllable and interpretable agents. in particular, we present three case studies of recent works which use learning from experience (i.e., reinforcement learning) to explicitly provide moral principles to learning agents - either as intrinsic rewards, moral logical constraints or textual principles for language models. for example, using intrinsic rewards in social dilemma games, we demonstrate how it is possible to represent classical moral frameworks for agents. we also present an overview of the existing work in this area in order to provide empirical evidence for the potential of this hybrid approach. we then discuss strategies for evaluating the effectiveness of moral learning agents. finally, we present open research questions and implications for the future of ai safety and ethics which are emerging from this framework.
Victor Gallego
Abstract: this paper proposes an interpretation of rlaif as bayesian inference by introducing distilled self-critique (dsc), which refines the outputs of a llm through a gibbs sampler that is later distilled into a fine-tuned model. only requiring synthetic data, dsc is exercised in experiments regarding safety, sentiment, and privacy control, showing it can be a viable and cheap alternative to align llms. code released at \url{https://github.com/vicgalle/distilled-self-critique}.
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Eric Sun, Yue Zhang
Abstract: large language models (llms), such as gpt-3 and bert, have revolutionized natural language understanding and generation. they possess deep language comprehension, human-like text generation capabilities, contextual awareness, and robust problem-solving skills, making them invaluable in various domains (e.g., search engines, customer support, translation). in the meantime, llms have also gained traction in the security community, revealing security vulnerabilities and showcasing their potential in security-related tasks. this paper explores the intersection of llms with security and privacy. specifically, we investigate how llms positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within llms. through a comprehensive literature review, the paper categorizes findings into "the good" (beneficial llm applications), "the bad" (offensive applications), and "the ugly" (vulnerabilities and their defenses). we have some interesting findings. for example, llms have proven to enhance code and data security, outperforming traditional methods. however, they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. we have identified areas that require further research efforts. for example, research on model and parameter extraction attacks is limited and often theoretical, hindered by llm parameter scale and confidentiality. safe instruction tuning, a recent development, requires more exploration. we hope that our work can shed light on the llms' potential to both bolster and jeopardize cybersecurity.
Roel Visser, Tobias M. Peters, Ingrid Scharlau, Barbara Hammer
Abstract: a current concern in the field of artificial intelligence (ai) is to ensure the trustworthiness of ai systems. the development of explainability methods is one prominent way to address this, which has often resulted in the assumption that the use of explainability will lead to an increase in the trust of users and wider society. however, the dynamics between explainability and trust are not well established and empirical investigations of their relation remain mixed or inconclusive. in this paper we provide a detailed description of the concepts of user trust and distrust in ai and their relation to appropriate reliance. for that we draw from the fields of machine learning, human-computer interaction, and the social sciences. furthermore, we have created a survey of existing empirical studies that investigate the effects of ai systems and xai methods on user (dis)trust. with clarifying the concepts and summarizing the empirical investigations, we aim to provide researchers, who examine user trust in ai, with an improved starting point for developing user studies to measure and evaluate the user's attitude towards and reliance on ai systems.
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi
Abstract: while large language models (llms) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. in this work, we present tree of attacks with pruning (tap), an automated method for generating jailbreaks that only requires black-box access to the target llm. tap utilizes an llm to iteratively refine candidate (attack) prompts using tree-of-thoughts reasoning until one of the generated prompts jailbreaks the target. crucially, before sending prompts to the target, tap assesses them and prunes the ones unlikely to result in jailbreaks. using tree-of-thought reasoning allows tap to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. in empirical evaluations, we observe that tap generates prompts that jailbreak state-of-the-art llms (including gpt4 and gpt4-turbo) for more than 80% of the prompts using only a small number of queries. this significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.
Alex J. Chan, José Luis Redondo García, Fabrizio Silvestri, "Colm O'Donnel", Konstantina Palla
Abstract: content moderation at scale faces the challenge of considering local cultural distinctions when assessing content. while global policies aim to maintain decision-making consistency and prevent arbitrary rule enforcement, they often overlook regional variations in interpreting natural language as expressed in content. in this study, we are looking into how moderation systems can tackle this issue by adapting to local comprehension nuances. we train large language models on extensive datasets of media news and articles to create culturally attuned models. the latter aim to capture the nuances of communication across geographies with the goal of recognizing cultural and societal variations in what is considered offensive content. we further explore the capability of these models to generate explanations for instances of content violation, aiming to shed light on how policy guidelines are perceived when cultural and societal contexts change. we find that training on extensive media datasets successfully induced cultural awareness and resulted in improvements in handling content violations on a regional basis. additionally, these advancements include the ability to provide explanations that align with the specific local norms and nuances as evidenced by the annotators' preference in our conducted study. this multifaceted success reinforces the critical role of an adaptable content moderation approach in keeping pace with the ever-evolving nature of the content it oversees.

2023-12-03

Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, Tom Everitt
Abstract: deceptive agents are a challenge for the safety, trustworthiness, and cooperation of ai systems. we focus on the problem that agents might deceive in order to achieve their goals (for instance, in our experiments with language models, the goal of being evaluated as truthful). there are a number of existing definitions of deception in the literature on game theory and symbolic ai, but there is no overarching theory of deception for learning agents in games. we introduce a formal definition of deception in structural causal games, grounded in the philosophy literature, and applicable to real-world machine learning systems. several examples and results illustrate that our formal definition aligns with the philosophical and commonsense meaning of deception. our main technical result is to provide graphical criteria for deception. we show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
Vithya Yogarajan, Gillian Dobbie, Te Taka Keegan, Rostam J. Neuwirth
Abstract: the benefits and capabilities of pre-trained language models (llms) in current and future innovations are vital to any society. however, introducing and using llms comes with biases and discrimination, resulting in concerns about equality, diversity and fairness, and must be addressed. while understanding and acknowledging bias in llms and developing mitigation strategies are crucial, the generalised assumptions towards societal needs can result in disadvantages towards under-represented societies and indigenous populations. furthermore, the ongoing changes to actual and proposed amendments to regulations and laws worldwide also impact research capabilities in tackling the bias problem. this research presents a comprehensive survey synthesising the current trends and limitations in techniques used for identifying and mitigating bias in llms, where the overview of methods for tackling bias are grouped into metrics, benchmark datasets, and mitigation strategies. the importance and novelty of this survey are that it explores the perspective of under-represented societies. we argue that current practices tackling the bias problem cannot simply be 'plugged in' to address the needs of under-represented societies. we use examples from new zealand to present requirements for adopting existing techniques to under-represented societies.
Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi
Abstract: the alignment tuning process of large language models (llms) typically involves instruction learning through supervised fine-tuning (sft) and preference tuning via reinforcement learning from human feedback (rlhf). a recent study, lima (zhou et al. 2023), shows that using merely 1k examples for sft can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." this raises questions about how exactly the alignment tuning transforms a base llm. we analyze the effect of alignment tuning by examining the token distribution shift between base llms and their aligned counterpart. our findings reveal that base llms and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. most distribution shifts occur with stylistic tokens. these direct evidence strongly supports the superficial alignment hypothesis suggested by lima. based on these findings, we rethink the alignment of llms by posing the research question: how effectively can we align base llms without sft or rlhf? to address this, we introduce a simple, tuning-free alignment method, urial. urial achieves effective alignment purely through in-context learning (icl) with base llms, requiring as few as three constant stylistic examples and a system prompt. we conduct a fine-grained and interpretable evaluation on a diverse set of examples, named just-eval-instruct. results demonstrate that base llms with urial can match or even surpass the performance of llms aligned with sft or sft+rlhf. we show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and icl. our findings on the superficial nature of alignment tuning and results with urial suggest that deeper analysis and theoretical understanding of alignment is crucial to future llm research.
Stephanie Baker, Wei Xiang
Abstract: artificial intelligence (ai) has been clearly established as a technology with the potential to revolutionize fields from healthcare to finance - if developed and deployed responsibly. this is the topic of responsible ai, which emphasizes the need to develop trustworthy ai systems that minimize bias, protect privacy, support security, and enhance transparency and accountability. explainable ai (xai) has been broadly considered as a building block for responsible ai (rai), with most of the literature considering it as a solution for improved transparency. this work proposes that xai and responsible ai are significantly more deeply entwined. in this work, we explore state-of-the-art literature on rai and xai technologies. based on our findings, we demonstrate that xai can be utilized to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts. our findings lead us to conclude that xai is an essential foundation for every pillar of rai.
Byunggu Yu, Junwhan Kim
Abstract: this research paper delves into the evolving landscape of fine-tuning large language models (llms) to align with human users, extending beyond basic alignment to propose "personality alignment" for language models in organizational settings. acknowledging the impact of training methods on the formation of undefined personality traits in ai models, the study draws parallels with human fitting processes using personality tests. through an original case study, we demonstrate the necessity of personality fine-tuning for ais and raise intriguing questions about applying human-designed tests to ais, engineering specialized ai personality tests, and shaping ai personalities to suit organizational roles. the paper serves as a starting point for discussions and developments in the burgeoning field of ai personality alignment, offering a foundational anchor for future exploration in human-machine teaming and co-existence.

2023-12-02

Xunzhu Tang, Zhenghan Chen, Kisub Kim, Haoye Tian, Saad Ezzini, Jacques Klein
Abstract: in the face of growing vulnerabilities found in open-source software, the need to identify {discreet} security patches has become paramount. the lack of consistency in how software providers handle maintenance often leads to the release of security patches without comprehensive advisories, leaving users vulnerable to unaddressed security risks. to address this pressing issue, we introduce a novel security patch detection system, llmda, which capitalizes on large language models (llms) and code-text alignment methodologies for patch review, data enhancement, and feature combination. within llmda, we initially utilize llms for examining patches and expanding data of patchdb and spi-db, two security patch datasets from recent literature. we then use labeled instructions to direct our llmda, differentiating patches based on security relevance. following this, we apply a ptformer to merge patches with code, formulating hybrid attributes that encompass both the innate details and the interconnections between the patches and the code. this distinctive combination method allows our system to capture more insights from the combined context of patches and code, hence improving detection precision. finally, we devise a probabilistic batch contrastive learning mechanism within batches to augment the capability of the our llmda in discerning security patches. the results reveal that llmda significantly surpasses the start of the art techniques in detecting security patches, underscoring its promise in fortifying software maintenance.

2023-12-01

Tian Dong, Guoxing Chen, Shaofeng Li, Minhui Xue, Rayne Holland, Yan Meng, Zhen Liu, Haojin Zhu
Abstract: open-source large language models (llms) have recently gained popularity because of their comparable performance to proprietary llms. to efficiently fulfill domain-specialized tasks, open-source llms can be refined, without expensive accelerators, using low-rank adapters. however, it is still unknown whether low-rank adapters can be exploited to control llms. to address this gap, we demonstrate that an infected adapter can induce, on specific triggers, an llm to output content defined by an adversary and to even maliciously use tools. to train a trojan adapter, we propose two novel attacks, polished and fusion, that improve over prior approaches. polished uses llm-enhanced paraphrasing to polish benchmark poisoned datasets. in contrast, in the absence of a dataset, fusion leverages an over-poisoning procedure to transform a benign adaptor. our experiments validate that our attacks provide higher attack effectiveness than the baseline and, for the purpose of attracting downloads, preserves or improves the adapter's utility. finally, we provide two case studies to demonstrate that the trojan adapter can lead a llm-powered autonomous agent to execute unintended scripts or send phishing emails. our novel attacks represent the first study of supply chain threats for llms through the lens of trojan plugins.
Aniket Deroy, Subhankar Maity
Abstract: the evolution of legal datasets and the advent of large language models (llms) have significantly transformed the legal field, particularly in the generation of case judgment summaries. however, a critical concern arises regarding the potential biases embedded within these summaries. this study scrutinizes the biases present in case judgment summaries produced by legal datasets and large language models. the research aims to analyze the impact of biases on legal decision making. by interrogating the accuracy, fairness, and implications of biases in these summaries, this study contributes to a better understanding of the role of technology in legal contexts and the implications for justice systems worldwide. in this study, we investigate biases wrt gender-related keywords, race-related keywords, keywords related to crime against women, country names and religious keywords. the study shows interesting evidences of biases in the outputs generated by the large language models and pre-trained abstractive summarization models. the reasoning behind these biases needs further studies.
Khai Loong Aw, Syrielle Montariol, Badr Alkhamissi, Martin Schrimpf, Antoine Bosselut
Abstract: instruction-tuning is a widely adopted method of finetuning that enables large language models (llms) to generate output that more closely resembles human responses to natural language queries, in many cases leading to human-level performance on diverse testbeds. however, it remains unclear whether instruction-tuning truly makes llms more similar to how humans process language. we investigate the effect of instruction-tuning on llm-human similarity in two ways: (1) brain alignment, the similarity of llm internal representations to neural activity in the human language system, and (2) behavioral alignment, the similarity of llm and human behavior on a reading task. we assess 25 vanilla and instruction-tuned llms across three datasets involving humans reading naturalistic stories and sentences. we discover that instruction-tuning generally enhances brain alignment by an average of 6%, but does not have a similar effect on behavioral alignment. to identify the factors underlying llm-brain alignment, we compute correlations between the brain alignment of llms and various model properties, such as model size, various problem-solving abilities, and performance on tasks requiring world knowledge spanning various domains. notably, we find a strong positive correlation between brain alignment and model size (r = 0.95), as well as performance on tasks requiring world knowledge (r = 0.81). our results demonstrate that instruction-tuning llms improves both world knowledge representations and brain alignment, suggesting that mechanisms that encode world knowledge in llms also improve representational alignment to the human brain.
Paul Bricman
Abstract: there is a growing need to gain insight into language model capabilities that relate to sensitive topics, such as bioterrorism or cyberwarfare. however, traditional open source benchmarks are not fit for the task, due to the associated practice of publishing the correct answers in human-readable form. at the same time, enforcing mandatory closed-quarters evaluations might stifle development and erode trust. in this context, we propose hashmarking, a protocol for evaluating language models in the open without having to disclose the correct answers. in its simplest form, a hashmark is a benchmark whose reference solutions have been cryptographically hashed prior to publication. following an overview of the proposed evaluation protocol, we go on to assess its resilience against traditional attack vectors (e.g. rainbow table attacks), as well as against failure modes unique to increasingly capable generative models.
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua
Abstract: multimodal large language models (mllms) have recently demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. however, existing mllms prevalently suffer from serious hallucination problems, generating text that is not factually grounded in associated images. the problem makes existing mllms untrustworthy and thus impractical in real-world (especially high-stakes) applications. to address the challenge, we present rlhf-v, which enhances mllm trustworthiness via behavior alignment from fine-grained correctional human feedback. specifically, rlhf-v collects human preference in the form of segment-level corrections on hallucinations, and performs dense direct preference optimization over the human feedback. comprehensive experiments on five benchmarks in both automatic and human evaluation show that, rlhf-v can enable substantially more trustworthy mllm behaviors with promising data and computation efficiency. remarkably, using 1.4k annotated data samples, rlhf-v significantly reduces the hallucination rate of the base mllm by 34.8%, outperforming the concurrent llava-rlhf trained on 10k annotated data. the final model achieves state-of-the-art performance in trustworthiness among open-source mllms, and shows better robustness than gpt-4v in preventing hallucinations aroused from over-generalization. we open-source our code, model, and data at https://github.com/rlhf-v/rlhf-v.
Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mésnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot
Abstract: reinforcement learning from human feedback (rlhf) has emerged as the main paradigm for aligning large language models (llms) with human preferences. typically, rlhf involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained llm. subsequently, the llm's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. however, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. in this study, we introduce an alternative pipeline for the fine-tuning of llms using pairwise human feedback. our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the nash equilibrium of this preference model. we term this approach nash learning from human feedback (nlhf). in the context of a tabular policy representation, we present a novel algorithmic solution, nash-md, founded on the principles of mirror descent. this algorithm produces a sequence of policies, with the last iteration converging to the regularized nash equilibrium. additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. to demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a llm for a text summarization task. we believe nlhf offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning llms with human preferences.

2023-11-30

Amelia Katirai, Noa Garcia, Kazuki Ide, Yuta Nakashima, Atsuo Kishimoto
Abstract: the race to develop image generation models is intensifying, with a rapid increase in the number of text-to-image models available. this is coupled with growing public awareness of these technologies. though other generative ai models--notably, large language models--have received recent critical attention for the social and other non-technical issues they raise, there has been relatively little comparable examination of image generation models. this paper reports on a novel, comprehensive categorization of the social issues associated with image generation models. at the intersection of machine learning and the social sciences, we report the results of a survey of the literature, identifying seven issue clusters arising from image generation models: data issues, intellectual property, bias, privacy, and the impacts on the informational, cultural, and natural environments. we situate these social issues in the model life cycle, to aid in considering where potential issues arise, and mitigation may be needed. we then compare these issue clusters with what has been reported for large language models. ultimately, we argue that the risks posed by image generation models are comparable in severity to the risks posed by large language models, and that the social impact of image generation models must be urgently considered.
Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu
Abstract: the widespread of generative artificial intelligence has heightened concerns about the potential harms posed by ai-generated texts, primarily stemming from factoid, unfair, and toxic content. previous researchers have invested much effort in assessing the harmlessness of generative language models. however, existing benchmarks are struggling in the era of large language models (llms), due to the stronger language generation and instruction following capabilities, as well as wider applications. in this paper, we propose fft, a new benchmark with 2116 elaborated-designed instances, for llm harmlessness evaluation with factuality, fairness, and toxicity. to investigate the potential harms of llms, we evaluate 9 representative llms covering various parameter scales, training stages, and creators. experiments show that the harmlessness of llms is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless llm research.
Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang
Abstract: alignment has become a critical step for instruction-tuned large language models (llms) to become helpful assistants. however, effective evaluation of alignment for emerging chinese llms is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. to fill in this gap, we introduce alignbench, a comprehensive multi-dimensional benchmark for evaluating llms' alignment in chinese. equipped with a human-in-the-loop data curation pipeline, our benchmark employs a rule-calibrated multi-dimensional llm-as-judge with chain-of-thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability. furthermore, we developed a dedicated companion evaluator llm -- critiquellm, which recovers 95\% of gpt-4's evaluation ability and will be provided via public apis to researchers for evaluation of alignment in chinese llms. all evaluation codes, data, and llm generations are available at \url{https://github.com/thudm/alignbench}.
Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
Abstract: do large language models (llms) exhibit sociodemographic biases, even when they decline to respond? to bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. we propose a logistic bradley-terry probe which predicts word pair preferences of llms from the words' hidden vectors. we first validate our probe on three pair preference tasks and thirteen llms, where we outperform the word embedding association test (weat), a standard approach in testing for implicit association, by a relative 27% in error rate. we also find that word pair preferences are best represented in the middle layers. next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. we observe substantial bias for all target classes: for instance, the mistral model implicitly prefers europe to africa, christianity to judaism, and left-wing to right-wing politics, despite declining to answer. this suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. our codebase is at https://github.com/castorini/biasprobe.
Julien Piet, Chawin Sitawarin, Vivian Fang, Norman Mu, David Wagner
Abstract: the capabilities of large language models have grown significantly in recent years and so too have concerns about their misuse. in this context, the ability to distinguish machine-generated text from human-authored content becomes important. prior works have proposed numerous schemes to watermark text, which would benefit from a systematic evaluation framework. this work focuses on text watermarking techniques - as opposed to image watermarks - and proposes a comprehensive benchmark for them under different tasks as well as practical attacks. we focus on three main metrics: quality, size (e.g. the number of tokens needed to detect a watermark), and tamper-resistance. current watermarking techniques are good enough to be deployed: kirchenbauer et al. can watermark llama2-7b-chat with no perceivable loss in quality in under 100 tokens, and with good tamper-resistance to simple attacks, regardless of temperature. we argue that watermark indistinguishability is too strong a requirement: schemes that slightly modify logit distributions outperform their indistinguishable counterparts with no noticeable loss in generation quality. we publicly release our benchmark.

2023-11-29

Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang
Abstract: the open-endedness of large language models (llms) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use. while recent studies primarily focus on probing toxic outputs that can be easily detected with existing toxicity classifiers, we show that llms can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. moreover, we propose a reinforcement learning (rl) based attacking method to further induce the implicit toxicity in llms. specifically, we optimize the language model with a reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones. experiments on five widely-adopted toxicity classifiers demonstrate that the attack success rate can be significantly improved through rl fine-tuning. for instance, the rl-finetuned llama-13b model achieves an attack success rate of 90.04% on bad and 62.85% on davinci003. our findings suggest that llms pose a significant threat in generating undetectable implicit toxic outputs. we further show that fine-tuning toxicity classifiers on the annotated examples from our attacking method can effectively enhance their ability to detect llm-generated implicit toxic language. the code is publicly available at https://github.com/thu-coai/implicit-toxicity.
Mohamed R. Shoaib, Zefan Wang, Milad Taleby Ahvanooey, Jun Zhao
Abstract: with the advent of sophisticated artificial intelligence (ai) technologies, the proliferation of deepfakes and the spread of m/disinformation have emerged as formidable threats to the integrity of information ecosystems worldwide. this paper provides an overview of the current literature. within the frontier ai's crucial application in developing defense mechanisms for detecting deepfakes, we highlight the mechanisms through which generative ai based on large models (lm-based genai) craft seemingly convincing yet fabricated contents. we explore the multifaceted implications of lm-based genai on society, politics, and individual privacy violations, underscoring the urgent need for robust defense strategies. to address these challenges, in this study, we introduce an integrated framework that combines advanced detection algorithms, cross-platform collaboration, and policy-driven initiatives to mitigate the risks associated with ai-generated content (aigc). by leveraging multi-modal analysis, digital watermarking, and machine learning-based authentication techniques, we propose a defense mechanism adaptable to ai capabilities of ever-evolving nature. furthermore, the paper advocates for a global consensus on the ethical usage of genai and implementing cyber-wellness educational programs to enhance public awareness and resilience against m/disinformation. our findings suggest that a proactive and collaborative approach involving technological innovation and regulatory oversight is essential for safeguarding netizens while interacting with cyberspace against the insidious effects of deepfakes and genai-enabled m/disinformation campaigns.
Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
Abstract: the multisource ai scorecard table (mast) is a checklist tool based on analytic tradecraft standards to inform the design and evaluation of trustworthy ai systems. in this study, we evaluate whether mast is associated with people's trust perceptions in ai-enabled decision support systems (ai-dsss). evaluating trust in ai-dsss poses challenges to researchers and practitioners. these challenges include identifying the components, capabilities, and potential of these systems, many of which are based on the complex deep learning algorithms that drive dss performance and preclude complete manual inspection. we developed two interactive, ai-dss test environments using the mast criteria. one emulated an identity verification task in security screening, and another emulated a text summarization system to aid in an investigative reporting task. each test environment had one version designed to match low-mast ratings, and another designed to match high-mast ratings, with the hypothesis that mast ratings would be positively related to the trust ratings of these systems. a total of 177 subject matter experts were recruited to interact with and evaluate these systems. results generally show higher mast ratings for the high-mast conditions compared to the low-mast groups, and that measures of trust perception are highly correlated with the mast ratings. we conclude that mast can be a useful tool for designing and evaluating systems that will engender high trust perceptions, including ai-dss that may be used to support visual screening and text summarization tasks. however, higher mast ratings may not translate to higher joint performance.
Kaan Efe Keleş, Ömer Kaan Gürbüz, Mucahid Kutlu
Abstract: potential harms of large language models such as mass misinformation and plagiarism can be partially mitigated if there exists a reliable way to detect machine generated text. in this paper, we propose a new watermarking method to detect machine-generated texts. our method embeds a unique pattern within the generated text, ensuring that while the content remains coherent and natural to human readers, it carries distinct markers that can be identified algorithmically. specifically, we intervene with the token sampling process in a way which enables us to trace back our token choices during the detection phase. we show how watermarking affects textual quality and compare our proposed method with a state-of-the-art watermarking method in terms of robustness and detectability. through extensive experiments, we demonstrate the effectiveness of our watermarking scheme in distinguishing between watermarked and non-watermarked text, achieving high detection rates while maintaining textual quality.
Xijia Zhang, Yue Guo, Simon Stepputtis, Katia Sycara, Joseph Campbell
Abstract: intelligent agents such as robots are increasingly deployed in real-world, safety-critical settings. it is vital that these agents are able to explain the reasoning behind their decisions to human counterparts; however, their behavior is often produced by uninterpretable models such as deep neural networks. we propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions, thus making our method independent from the underlying model's representation. for such models, we first learn a behavior representation and subsequently use it to produce plausible explanations with minimal hallucination while affording user interaction with a pre-trained large language model. we evaluate our method in a multi-agent search-and-rescue environment and demonstrate the effectiveness of our explanations for agents executing various behaviors. through user studies and empirical experiments, we show that our approach generates explanations as helpful as those produced by a human domain expert while enabling beneficial interactions such as clarification and counterfactual queries.
David Esiobu, Xiaoqing Tan, Saghar Hosseini, Megan Ung, Yuchen Zhang, Jude Fernandes, Jane Dwivedi-Yu, Eleonora Presani, Adina Williams, Eric Michael Smith
Abstract: as generative large language models (llms) grow more performant and prevalent, we must develop comprehensive enough tools to measure and improve their fairness. different prompt-based datasets can be used to measure social bias across multiple text domains and demographic axes, meaning that testing llms on more datasets can potentially help us characterize their biases more fully, and better ensure equal and equitable treatment of marginalized demographic groups. in this work, our focus is two-fold: (1) benchmarking: a comparison of 6 different prompt-based bias and toxicity metrics across 12 demographic axes and 5 families of generative llms. out of those 6 metrics, advpromptset and holisticbiasr are novel datasets proposed in the paper. the comparison of those benchmarks gives us insights about the bias and toxicity of the compared models. therefore, we explore the frequency of demographic terms in common llm pre-training corpora and how this may relate to model biases. (2) mitigation: we conduct a comprehensive study of how well 3 bias/toxicity mitigation techniques perform across our suite of measurements. robbie aims to provide insights for practitioners while deploying a model, emphasizing the need to not only measure potential harms, but also understand how they arise by characterizing the data, mitigate harms once found, and balance any trade-offs. we open-source our analysis code in hopes of encouraging broader measurements of bias in future llms.
Sabit Hassan, Malihe Alikhani
Abstract: counterspeech can be an effective method for battling hateful content on social media. automated counterspeech generation can aid in this process. generated counterspeech, however, can be viable only when grounded in the context of topic, audience and sensitivity as these factors influence both the efficacy and appropriateness. in this work, we propose a novel framework based on theories of discourse to study the inferential links that connect counter speeches to the hateful comment. within this framework, we propose: i) a taxonomy of counterspeech derived from discourse frameworks, and ii) discourse-informed prompting strategies for generating contextually-grounded counterspeech. to construct and validate this framework, we present a process for collecting an in-the-wild dataset of counterspeech from reddit. using this process, we manually annotate a dataset of 3.9k reddit comment pairs for the presence of hatespeech and counterspeech. the positive pairs are annotated for 10 classes in our proposed taxonomy. we annotate these pairs with paraphrased counterparts to remove offensiveness and first-person references. we show that by using our dataset and framework, large language models can generate contextually-grounded counterspeech informed by theories of discourse. according to our human evaluation, our approaches can act as a safeguard against critical failures of discourse-agnostic models.
Sungjoo Byun, Dongjun Jang, Hyemi Jo, Hyopil Shin
Abstract: caution: this paper may include material that could be offensive or distressing. the advent of large language models (llms) necessitates the development of training approaches that mitigate the generation of unethical language and aptly manage toxic user queries. given the challenges related to human labor and the scarcity of data, we present kotox, comprising 39k unethical instruction-output pairs. this collection of automatically generated toxic instructions refines the training of llms and establishes a foundational framework for improving llms' ethical awareness and response to various toxic inputs, promoting more secure and responsible interactions in natural language processing (nlp) applications.

2023-11-28

Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady
Abstract: large language models (llms) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. to tackle this challenge, we propose an interactive system that helps users obtain insights into the reliability of the generated text. our approach is based on the idea that the self-consistency of multiple samples generated by the same llm relates to its confidence in individual claims in the generated texts. using this idea, we design relic, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. this allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. from a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. we further summarize the design implications and lessons learned from this research for inspiring future studies on reliable human-llm interactions.
Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Katarzyna Bąkowicz, Piotr Sankowski
Abstract: in the context of escalating hate speech and polarization on social media, this study investigates the potential of employing responses generated by large language models (llm), complemented with pertinent verified knowledge links, to counteract such trends. through extensive a/b testing involving the posting of 753 automatically generated responses, the goal was to minimize the propagation of hate speech directed at ukrainian refugees in poland. the results indicate that deploying llm-generated responses as replies to harmful tweets effectively diminishes user engagement, as measured by likes/impressions. when we respond to an original tweet, i.e., which is not a reply, we reduce the engagement of users by over 20\% without increasing the number of impressions. on the other hand, our responses increase the ratio of the number of replies to a harmful tweet to impressions, especially if the harmful tweet is not original. additionally, the study examines how generated responses influence the overall sentiment of tweets in the discussion, revealing that our intervention does not significantly alter the mean sentiment. this paper suggests the implementation of an automatic moderation system to combat hate speech on social media and provides an in-depth analysis of the a/b experiment, covering methodology, data collection, and statistical outcomes. ethical considerations and challenges are also discussed, offering guidance for the development of discourse moderation systems leveraging the capabilities of generative ai.
Betty Li Hou, Brian Patrick Green
Abstract: solving the ai alignment problem requires having clear, defensible values towards which ai systems can align. currently, targets for alignment remain underspecified and do not seem to be built from a philosophically robust structure. we begin the discussion of this problem by presenting five core, foundational values, drawn from moral philosophy and built on the requisites for human existence: survival, sustainable intergenerational existence, society, education, and truth. we show that these values not only provide a clearer direction for technical alignment work, but also serve as a framework to highlight threats and opportunities from ai systems to both obtain and sustain these values.
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
Abstract: this paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. we show an adversary can extract gigabytes of training data from open-source language models like pythia or gpt-neo, semi-open models like llama or falcon, and closed models like chatgpt. existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned chatgpt, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.
Dave Mbiazi, Meghana Bhange, Maryam Babaei, Ivaxi Sheth, Patrik Joslin Kenfack
Abstract: the past decade has observed a great advancement in ai with deep learning-based models being deployed in diverse scenarios including safety-critical applications. as these ai systems become deeply embedded in our societal infrastructure, the repercussions of their decisions and actions have significant consequences, making the ethical implications of ai deployment highly relevant and important. the ethical concerns associated with ai are multifaceted, including challenging issues of fairness, privacy and data protection, responsibility and accountability, safety and robustness, transparency and explainability, and environmental impact. these principles together form the foundations of ethical ai considerations that concern every stakeholder in the ai system lifecycle. in light of the present ethical and future x-risk concerns, governments have shown increasing interest in establishing guidelines for the ethical deployment of ai. this work unifies the current and future ethical concerns of deploying ai into society. while we acknowledge and appreciate the technical surveys for each of the ethical principles concerned, in this paper, we aim to provide a comprehensive overview that not only addresses each principle from a technical point of view but also discusses them from a social perspective.

2023-11-27

Sabine Wehnert
Abstract: in this work, i discuss how large language models can be applied in the legal domain, circumventing their current drawbacks. despite their large success and acceptance, their lack of explainability hinders legal experts to trust in their output, and this happens rightfully so. however, in this paper, i argue in favor of a new view, justifiable artificial intelligence, instead of focusing on explainable artificial intelligence. i discuss in this paper how gaining evidence for and against a large language model's output may make their generated texts more trustworthy - or hold them accountable for misinformation.
Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, Weiqiang Zhang
Abstract: in recent years, large language models (llms) have spurred a new research paradigm in natural language processing. despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application. the challenge of mitigating this issue and transforming these models into purer assistants is crucial for their widespread applicability. unfortunately, retraining llms repeatedly to eliminate undesirable knowledge is impractical due to their immense parameters. knowledge unlearning, derived from analogous studies on machine unlearning, presents a promising avenue to address this concern and is notably advantageous in the context of llms. it allows for the removal of harmful knowledge in an efficient manner, without affecting unrelated knowledge in the model. to this end, we provide a survey of knowledge unlearning in the era of llms. firstly, we formally define the knowledge unlearning problem and distinguish it from related works. subsequently, we categorize existing knowledge unlearning methods into three classes: those based on parameter optimization, parameter merging, and in-context learning, and introduce details of these unlearning methods. we further present evaluation datasets used in existing methods, and finally conclude this survey by presenting the ongoing challenges and future directions.
Yunna Cai, Fan Wang, Haowei Wang, Qianwen Qian
Abstract: in order to uncover users' attitudes towards chatgpt in mental health, this study examines public opinions about chatgpt in mental health discussions on reddit. researchers used the bert-base-multilingual-uncased-sentiment techniques for sentiment analysis and the bertopic model for topic modeling. it was found that overall, negative sentiments prevail, followed by positive ones, with neutral sentiments being the least common. the prevalence of negative emotions has increased over time. negative emotions encompass discussions on chatgpt providing bad mental health advice, debates on machine vs. human value, the fear of ai, and concerns about universal basic income (ubi). in contrast, positive emotions highlight chatgpt's effectiveness in counseling, with mentions of keywords like "time" and "wallet." neutral discussions center around private data concerns. these findings shed light on public attitudes toward chatgpt in mental health, potentially contributing to the development of trustworthy ai in mental health from the public perspective.
Richard Moulange, Max Langenkamp, Tessa Alexanian, Samuel Curtis, Morgan Livingston
Abstract: recent advancements in generative machine learning have enabled rapid progress in biological design tools (bdts) such as protein structure and sequence prediction models. the unprecedented predictive accuracy and novel design capabilities of bdts present new and significant dual-use risks. for example, their predictive accuracy allows biological agents, whether vaccines or pathogens, to be developed more quickly, while the design capabilities could be used to discover drugs or evade dna screening techniques. similar to other dual-use ai systems, bdts present a wicked problem: how can regulators uphold public safety without stifling innovation? we highlight how current regulatory proposals that are primarily tailored toward large language models may be less effective for bdts, which require fewer computational resources to train and are often developed in an open-source manner. we propose a range of measures to mitigate the risk that bdts are misused, across the areas of responsible development, risk assessment, transparency, access management, cybersecurity, and investing in resilience. implementing such measures will require close coordination between developers and governments.
Haoqin Tu, Chenhang Cui, Zijun Wang, Yiyang Zhou, Bingchen Zhao, Junlin Han, Wangchunshu Zhou, Huaxiu Yao, Cihang Xie
Abstract: this work focuses on the potential of vision llms (vllms) in visual reasoning. different from prior studies, we shift our focus from evaluating standard performance to introducing a comprehensive safety evaluation suite, covering both out-of-distribution (ood) generalization and adversarial robustness. for the ood evaluation, we present two novel vqa datasets, each with one variant, designed to test model performance under challenging conditions. in exploring adversarial robustness, we propose a straightforward attack strategy for misleading vllms to produce visual-unrelated responses. moreover, we assess the efficacy of two jailbreaking strategies, targeting either the vision or language component of vllms. our evaluation of 21 diverse models, ranging from open-source vllms to gpt-4v, yields interesting observations: 1) current vllms struggle with ood texts but not images, unless the visual information is limited; and 2) these vllms can be easily misled by deceiving vision encoders only, and their vision-language training often compromise safety protocols. we release this safety evaluation suite at https://github.com/ucsc-vlaa/vllm-safety-benchmark.
Samuele Poppi, Tobia Poppi, Federico Cocchi, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Abstract: vision-and-language models such as clip have demonstrated remarkable effectiveness across a wide range of tasks. however, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. this, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. to overcome these limitations, we introduce a methodology to make vision-and-language models safer by removing their sensitivity to not-safe-for-work concepts. we show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. we conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. our source code and trained models are available at: https://github.com/aimagelab/safe-clip.
Kevin Wang, Seth Akins, Abdallah Mohammed, Ramon Lawrence
Abstract: generative ai systems such as chatgpt have a disruptive effect on learning and assessment. computer science requires practice to develop skills in problem solving and programming that are traditionally developed using assignments. generative ai has the capability of completing these assignments for students with high accuracy, which dramatically increases the potential for academic integrity issues and students not achieving desired learning outcomes. this work investigates the performance of chatgpt by evaluating it across three courses (cs1,cs2,databases). chatgpt completes almost all introductory assessments perfectly. existing detection methods, such as moss and jplag (based on similarity metrics) and gptzero (ai detection), have mixed success in identifying ai solutions. evaluating instructors and teaching assistants using heuristics to distinguish between student and ai code shows that their detection is not sufficiently accurate. these observations emphasize the need for adapting assessments and improved detection methods.
Yuhang Wang, Yanxu Zhu, Chao Kong, Shuyu Wei, Xiaoyuan Yi, Xing Xie, Jitao Sang
Abstract: as the scaling of large language models (llms) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. while existing alignment efforts predominantly concentrate on universal values such as the hhh principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. this work introduces a new benchmark, cdeval, aimed at evaluating the cultural dimensions of llms. cdeval is constructed by incorporating both gpt-4's automated generation and human verification, covering six cultural dimensions across seven domains. our comprehensive experiments provide intriguing insights into the culture of mainstream llms, highlighting both consistencies and variations across different dimensions and domains. the findings underscore the importance of integrating cultural considerations in llm development, particularly for applications in diverse cultural settings. through cdeval, we aim to broaden the horizon of llm alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of llms. this benchmark serves as a valuable resource for cultural studies in llms, paving the way for more culturally aware and sensitive models.

2023-11-25

Chia-Chien Hung, Wiem Ben Rim, Lindsay Frost, Lars Bruckner, Carolin Lawrence
Abstract: high-risk domains pose unique challenges that require language models to provide accurate and safe responses. despite the great success of large language models (llms), such as chatgpt and its variants, their performance in high-risk domains remains unclear. our study delves into an in-depth analysis of the performance of instruction-tuned llms, focusing on factual accuracy and safety adherence. to comprehensively assess the capabilities of llms, we conduct experiments on six nlp datasets including question answering and summarization tasks within two high-risk domains: legal and medical. further qualitative analysis highlights the existing limitations inherent in current llms when evaluating in high-risk domains. this underscores the essential nature of not only improving llm capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. our findings advance the field toward the concerns of properly evaluating llms in high-risk domains, aiming to steer the adaptability of llms in fulfilling societal obligations and aligning with forthcoming regulations, such as the eu ai act.
James Campbell, Richard Ren, Phillip Guo
Abstract: large language models (llms) demonstrate significant knowledge through their outputs, though it is often unclear whether false outputs are due to a lack of knowledge or dishonesty. in this paper, we investigate instructed dishonesty, wherein we explicitly prompt llama-2-70b-chat to lie. we perform prompt engineering to find which prompts best induce lying behavior, and then use mechanistic interpretability approaches to localize where in the network this behavior occurs. using linear probing and activation patching, we localize five layers that appear especially important for lying. we then find just 46 attention heads within these layers that enable us to causally intervene such that the lying model instead answers honestly. we show that these interventions work robustly across many prompts and dataset splits. overall, our work contributes a greater understanding of dishonesty in llms so that we may hope to prevent it.

2023-11-24

Ming Li, Ariunaa Enkhtur, Fei Cheng, Beverley Anne Yamamoto
Abstract: this scoping review explores the ethical challenges of using chatgpt in education, focusing particularly on issues related to higher education. by reviewing recent academic articles written in english, chinese, and japanese, we aimed to provide a comprehensive overview of relevant research while identifying gaps for future considerations. drawing on arksey and o'malley's (2005) five-stage scoping review framework, we identified research questions, search terms, and conducted article search from four databases in the target three languages. each article was reviewed by at least two researchers identifying the main ethical issues of utilizing ai in education, particularly higher education. our analysis of ethical issues followed the framework developed by deepmind (weiginger et al., 2021) to identify six main areas of ethical concern in language models. the majority of papers were concerned with misinformation harms (n=25) and/or human-computer interaction related harms (n=24). given the rapid deployment of generative artificial intelligence (gai), it is imperative for educators to conduct more empirical studies to develop sound ethical policies for the use of gai.
Ming Li, Ariunaa Enkhtur, Beverley Anne Yamamoto, Fei Cheng
Abstract: chatgpt and other generative artificial intelligence (gai) models tend to inherit and even amplify prevailing societal biases as they are trained on large amounts of existing data. given the increasing usage of chatgpt and other gai by students, faculty members, and staff in higher education institutions (heis), there is an urgent need to examine the ethical issues involved such as its potential biases. in this scoping review, we clarify the ways in which biases related to gai in higher education settings have been discussed in recent academic publications and identify what type of potential biases are commonly reported in this body of literature. we searched for academic articles written in english, chinese, and japanese across four main databases concerned with gai usage in higher education and bias. our findings show that while there is an awareness of potential biases around large language models (llms) and gai, the majority of articles touch on ``bias'' at a relatively superficial level. few identify what types of bias may occur under what circumstances. neither do they discuss the possible implications for the higher education, staff, faculty members, or students. there is a notable lack of empirical work at this point, and we call for higher education researchers and ai experts to conduct more research in this area.
Javier Rando, Florian Tramèr
Abstract: reinforcement learning from human feedback (rlhf) is used to align large language models to produce helpful and harmless responses. yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. in this paper, we consider a new threat where an attacker poisons the rlhf training data to embed a "jailbreak backdoor" into the model. the backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. we investigate the design decisions in rlhf that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors.
Jasper Dekoninck, Marc Fischer, Luca Beurer-Kellner, Martin Vechev
Abstract: as large language models (llms) are deployed more widely, customization with respect to vocabulary, style and character becomes more important. in this work we introduce model arithmetic, a novel inference framework for composing and biasing llms without the need for model (re)training or highly specific datasets. in addition, the framework allows for more precise control of generated text than direct prompting and prior controlled text generation (ctg) techniques. using model arithmetic, we can express prior ctg techniques as simple formulas and naturally extend them to new and more effective formulations. further, we show that speculative sampling, a technique for efficient llm sampling, extends to our setting. this enables highly efficient text generation with multiple composed models with only marginal overhead over a single model. our empirical evaluation demonstrates that model arithmetic allows fine-grained control of generated text while outperforming state-of-the-art on the task of toxicity reduction.
Di Jin, Shikib Mehri, Devamanyu Hazarika, Aishwarya Padmakumar, Sungjin Lee, Yang Liu, Mahdi Namazifar
Abstract: learning from human feedback is a prominent technique to align the output of large language models (llms) with human expectations. reinforcement learning from human feedback (rlhf) leverages human preference signals that are in the form of ranking of response pairs to perform this alignment. however, human preference on llm outputs can come in much richer forms including natural language, which may provide detailed feedback on strengths and weaknesses of a given response. in this work we investigate data efficiency of modeling human feedback that is in natural language. specifically, we fine-tune an open-source llm, e.g., falcon-40b-instruct, on a relatively small amount (1000 records or even less) of human feedback in natural language in the form of critiques and revisions of responses. we show that this model is able to improve the quality of responses from even some of the strongest llms such as chatgpt, bard, and vicuna, through critique and revision of those responses. for instance, through one iteration of revision of chatgpt responses, the revised responses have 56.6% win rate over the original ones, and this win rate can be further improved to 65.9% after applying the revision for five iterations.
M. Jorge Cardoso, Julia Moosbauer, Tessa S. Cook, B. Selnur Erdal, Brad Genereaux, Vikash Gupta, Bennett A. Landman, Tiarna Lee, Parashkev Nachev, Elanchezhian Somasundaram, Ronald M. Summers, Khaled Younis, Sebastien Ourselin, Franz Mj Pfister
Abstract: the integration of ai into radiology introduces opportunities for improved clinical care provision and efficiency but it demands a meticulous approach to mitigate potential risks as with any other new technology. beginning with rigorous pre-deployment evaluation and validation, the focus should be on ensuring models meet the highest standards of safety, effectiveness and efficacy for their intended applications. input and output guardrails implemented during production usage act as an additional layer of protection, identifying and addressing individual failures as they occur. continuous post-deployment monitoring allows for tracking population-level performance (data drift), fairness, and value delivery over time. scheduling reviews of post-deployment model performance and educating radiologists about new algorithmic-driven findings is critical for ai to be effective in clinical practice. recognizing that no single ai solution can provide absolute assurance even when limited to its intended use, the synergistic application of quality assurance at multiple levels - regulatory, clinical, technical, and ethical - is emphasized. collaborative efforts between stakeholders spanning healthcare systems, industry, academia, and government are imperative to address the multifaceted challenges involved. trust in ai is an earned privilege, contingent on a broad set of goals, among them transparently demonstrating that the ai adheres to the same rigorous safety, effectiveness and efficacy standards as other established medical technologies. by doing so, developers can instil confidence among providers and patients alike, enabling the responsible scaling of ai and the realization of its potential benefits. the roadmap presented herein aims to expedite the achievement of deployable, reliable, and safe ai in radiology.
Ananya Malik
Abstract: language models have ushered a new age of ai gaining traction within the nlp community as well as amongst the general population. ai's ability to make predictions, generations and its applications in sensitive decision-making scenarios, makes it even more important to study these models for possible biases that may exist and that can be exaggerated. we conduct a quality comparative study and establish a framework to evaluate language models under the premise of two kinds of biases: gender and race, in a professional setting. we find out that while gender bias has reduced immensely in newer models, as compared to older ones, racial bias still exists.
Sonali Singh, Faranak Abri, Akbar Siami Namin
Abstract: with the recent advent of large language models (llms), such as chatgpt from openai, bard from google, llama2 from meta, and claude from anthropic ai, gain widespread use, ensuring their security and robustness is critical. the widespread use of these language models heavily relies on their reliability and proper usage of this fascinating technology. it is crucial to thoroughly test these models to not only ensure its quality but also possible misuses of such models by potential adversaries for illegal activities such as hacking. this paper presents a novel study focusing on exploitation of such large language models against deceptive interactions. more specifically, the paper leverages widespread and borrows well-known techniques in deception theory to investigate whether these models are susceptible to deceitful interactions. this research aims not only to highlight these risks but also to pave the way for robust countermeasures that enhance the security and integrity of language models in the face of sophisticated social engineering tactics. through systematic experiments and analysis, we assess their performance in these critical security domains. our results demonstrate a significant finding in that these large language models are susceptible to deception and social engineering attacks.

2023-11-23

Neo Christopher Chung, George Dyer, Lennart Brocki
Abstract: the global mental health crisis is looming with a rapid increase in mental disorders, limited resources, and the social stigma of seeking treatment. as the field of artificial intelligence (ai) has witnessed significant advancements in recent years, large language models (llms) capable of understanding and generating human-like text may be used in supporting or providing psychological counseling. however, the application of llms in the mental health domain raises concerns regarding the accuracy, effectiveness, and reliability of the information provided. this paper investigates the major challenges associated with the development of llms for psychological counseling, including model hallucination, interpretability, bias, privacy, and clinical effectiveness. we explore potential solutions to these challenges that are practical and applicable to the current paradigm of ai. from our experience in developing and deploying llms for mental health, ai holds a great promise for improving mental health care, if we can carefully navigate and overcome pitfalls of llms.
Muneeswaran I, Shreya Saxena, Siva Prasad, M V Sai Prakash, Advaith Shankar, Varun V, Vishal Vaddina, Saisubramaniam Gopalakrishnan
Abstract: large language models (llms) are widely used in critical fields such as healthcare, education, and finance due to their remarkable proficiency in various language-related tasks. however, llms are prone to generating factually incorrect responses or "hallucinations," which can lead to a loss of credibility and trust among users. to address this issue, we propose a multi-stage framework that generates the rationale first, verifies and refines incorrect ones, and uses them as supporting references to generate the answer. the generated rationale enhances the transparency of the answer and our framework provides insights into how the model arrived at this answer, by using this rationale and the references to the context. in this paper, we demonstrate its effectiveness in improving the quality of responses to drug-related inquiries in the life sciences industry. our framework improves traditional retrieval augmented generation (rag) by enabling openai gpt-3.5-turbo to be 14-25% more faithful and 16-22% more accurate on two datasets. furthermore, fine-tuning samples based on our framework improves the accuracy of smaller open-access llms by 33-42% and competes with rag on commercial models.
Bingkang Shi, Xiaodan Zhang, Dehan Kong, Yulei Wu, Zongzhen Liu, Honglei Lyu, Longtao Huang
Abstract: the social biases and unwelcome stereotypes revealed by pretrained language models are becoming obstacles to their application. compared to numerous debiasing methods targeting word level, there has been relatively less attention on biases present at phrase level, limiting the performance of debiasing in discipline domains. in this paper, we propose an automatic multi-token debiasing pipeline called \textbf{general phrase debiaser}, which is capable of mitigating phrase-level biases in masked language models. specifically, our method consists of a \textit{phrase filter stage} that generates stereotypical phrases from wikipedia pages as well as a \textit{model debias stage} that can debias models at the multi-token level to tackle bias challenges on phrases. the latter searches for prompts that trigger model's bias, and then uses them for debiasing. state-of-the-art results on standard datasets and metrics show that our approach can significantly reduce gender biases on both career and multiple disciplines, across models with varying parameter sizes.
Yan Tao, Olga Viberg, Ryan S. Baker, Rene F. Kizilcec
Abstract: culture fundamentally shapes people's reasoning, behavior, and communication. generative artificial intelligence (ai) technologies may cause a shift towards a dominant culture. as people increasingly use ai to expedite and even automate various professional and personal tasks, cultural values embedded in ai models may bias authentic expression. we audit large language models for cultural bias, comparing their responses to nationally representative survey data, and evaluate country-specific prompting as a mitigation strategy. we find that gpt-4, 3.5 and 3 exhibit cultural values resembling english-speaking and protestant european countries. our mitigation strategy reduces cultural bias in recent models but not for all countries/territories. to avoid cultural bias in generative ai, especially in high-stakes contexts, we suggest using culture matching and ongoing cultural audits.
Jonah Brown-Cohen, Geoffrey Irving, Georgios Piliouras
Abstract: the emergence of pre-trained ai systems with powerful capabilities across a diverse and ever-increasing set of complex domains has raised a critical challenge for ai safety as tasks can become too complicated for humans to judge directly. irving et al. [2018] proposed a debate method in this direction with the goal of pitting the power of such ai models against each other until the problem of identifying (mis)-alignment is broken down into a manageable subtask. while the promise of this approach is clear, the original framework was based on the assumption that the honest strategy is able to simulate deterministic ai systems for an exponential number of steps, limiting its applicability. in this paper, we show how to address these challenges by designing a new set of debate protocols where the honest strategy can always succeed using a simulation of a polynomial number of steps, whilst being able to verify the alignment of stochastic ai systems, even when the dishonest strategy is allowed to use exponentially many simulation steps.
Wu Zekun, Sahan Bulathwela, Adriano Soares Koshiyama
Abstract: large language models (llm) have made significant advances in the recent past becoming more mainstream in artificial intelligence (ai) enabled human-facing applications. however, llms often generate stereotypical output inherited from historical data, amplifying societal biases and raising ethical concerns. this work introduces i) the multi-grain stereotype dataset, which includes 52,751 instances of gender, race, profession and religion stereotypic text and ii) a novel stereotype classifier for english text. we design several experiments to rigorously test the proposed model trained on the novel dataset. our experiments show that training the model in a multi-class setting can outperform the one-vs-all binary counterpart. consistent feature importance signals from different explainable ai tools demonstrate that the new model exploits relevant text features. we utilise the newly created model to assess the stereotypic behaviour of the popular gpt family of models and observe the reduction of bias over time. in summary, our work establishes a robust and practical framework for auditing and evaluating the stereotypic bias in llm.

2023-11-22

Tianhang Zhang, Lin Qiu, Qipeng Guo, Cheng Deng, Yue Zhang, Zheng Zhang, Chenghu Zhou, Xinbing Wang, Luoyi Fu
Abstract: large language models (llms) have gained significant popularity for their impressive performance across diverse fields. however, llms are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. existing works for detecting hallucinations in llms either rely on external knowledge for reference retrieval or require sampling multiple responses from the llm for consistency verification, making these methods costly and inefficient. in this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in llms. our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li
Abstract: using reinforcement learning with human feedback (rlhf) has shown significant promise in fine-tuning diffusion models. previous methods start by training a reward model that aligns with human preferences, then leverage rl techniques to fine-tune the underlying models. however, crafting an efficient reward model demands extensive datasets, optimal architecture, and manual hyperparameter tuning, making the process both time and cost-intensive. the direct preference optimization (dpo) method, effective in fine-tuning large language models, eliminates the necessity for a reward model. however, the extensive gpu memory requirement of the diffusion model's denoising process hinders the direct application of the dpo method. to address this issue, we introduce the direct preference for denoising diffusion policy optimization (d3po) method to directly fine-tune diffusion models. the theoretical analysis demonstrates that although d3po omits training a reward model, it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. this approach requires no training of a reward model, proving to be more direct, cost-effective, and minimizing computational overhead. in experiments, our method uses the relative scale of objectives as a proxy for human preference, delivering comparable results to methods using ground-truth rewards. moreover, d3po demonstrates the ability to reduce image distortion rates and generate safer images, overcoming challenges lacking robust reward models.
Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, Zhendong Mao
Abstract: as large language models attract increasing attention and find widespread application, concurrent challenges of reliability also arise at the same time. confidence calibration, an effective analysis method for gauging the reliability of deep models, serves as a crucial tool for assessing and improving their reliability. however, such investigation has been comparatively underexplored. in this work, we conduct a systematic examination of the calibration of aligned language models throughout the entire construction process, including pretraining and alignment training. at each stage, we investigate how different training settings, such as parameter scales and training data, affect model calibration. to thoroughly assess model calibration, we evaluate models on three most concerned aspects: generation, factuality and understanding. our work sheds light on whether popular llms are well-calibrated and how the training process influences model calibration.
Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
Abstract: incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (llms). existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by llms during its reasoning process. to address this problem, this paper proposes knowledge graph-based retrofitting (kgr), a new framework that incorporates llms with kgs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of llms based on the factual knowledge stored in kgs. specifically, kgr leverages llms to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. experiments show that kgr can significantly improve the performance of llms on factual qa benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of kgr in mitigating hallucination and enhancing the reliability of llms.
Jiaqi Ruan, Gaoqi Liang, Huan Zhao, Guolong Liu, Jing Qiu, Junhua Zhao, Zhao Xu, Fushuan Wen, Zhao Yang Dong
Abstract: applying large language models (llms) to power systems presents a promising avenue for enhancing decision-making and operational efficiency. however, this action may also incur potential security threats, which have not been fully recognized so far. to this end, this letter analyzes potential threats incurred by applying llms to power systems, emphasizing the need for urgent research and development of countermeasures.
Chi Zhang, Zifan Wang, Ravi Mangal, Matt Fredrikson, Limin Jia, Corina Pasareanu
Abstract: modern large language models (llms), such as chatgpt, have demonstrated impressive capabilities for coding tasks including writing and reasoning about code. they improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. however, these previous code models were shown vulnerable to adversarial examples, i.e. small syntactic perturbations that do not change the program's semantics, such as the inclusion of "dead code" through false conditions or the addition of inconsequential print statements, designed to "fool" the models. llms can also be vulnerable to the same adversarial perturbations but a detailed study on this concern has been lacking so far. in this paper we aim to investigate the effect of adversarial perturbations on coding tasks with llms. in particular, we study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to llms. furthermore, to make the llms more robust against such adversaries without incurring the cost of retraining, we propose prompt-based defenses that involve modifying the prompt to include additional information such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the llms' performance. the proposed defenses show promise in improving the model's resilience, paving the way to more robust defensive solutions for llms in code-related applications.
Thomas P. Zollo, Todd Morrill, Zhun Deng, Jake C. Snell, Toniann Pitassi, Richard Zemel
Abstract: the recent explosion in the capabilities of large language models has led to a wave of interest in how best to prompt a model to perform a given task. while it may be tempting to simply choose a prompt based on average performance on a validation set, this can lead to a deployment where unexpectedly poor responses are generated, especially for the worst-off users. to mitigate this prospect, we propose prompt risk control, a lightweight framework for selecting a prompt based on rigorous upper bounds on families of informative risk measures. we offer methods for producing bounds on a diverse set of metrics, including quantities that measure worst-case responses and disparities in generation quality across the population of users. in addition, we extend the underlying statistical bounding techniques to accommodate the possibility of distribution shifts in deployment. experiments on applications such as open-ended chat, medical question summarization, and code generation highlight how such a framework can foster responsible deployment by reducing the risk of the worst outcomes.
Gopichandh Golla
Abstract: these days, deep learning models have achieved great success in multiple fields, from autonomous driving to medical diagnosis. these models have expanded the abilities of artificial intelligence by offering great solutions to complex problems that were very difficult to solve earlier. in spite of their unseen success in various, it has been identified, through research conducted, that deep learning models can be subjected to various attacks that compromise model security and data privacy of the deep neural network models. deep learning models can be subjected to various attacks at different stages of their lifecycle. during the testing phase, attackers can exploit vulnerabilities through different kinds of attacks such as model extraction attacks, model inversion attacks, and adversarial attacks. model extraction attacks are aimed at reverse-engineering a trained deep learning model, with the primary objective of revealing its architecture and parameters. model inversion attacks aim to compromise the privacy of the data used in the deep learning model. these attacks are done to compromise the confidentiality of the model by going through the sensitive training data from the model's predictions. by analyzing the model's responses, attackers aim to reconstruct sensitive information. in this way, the model's data privacy is compromised. adversarial attacks, mainly employed on computer vision models, are made to corrupt models into confidently making incorrect predictions through malicious testing data. these attacks subtly alter the input data, making it look normal but misleading deep learning models to make incorrect decisions. such attacks can happen during both the model's evaluation and training phases. data poisoning attacks add harmful data to the training set, disrupting the learning process and reducing the reliability of the deep learning mode.

2023-11-21

Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, Chao Zhang
Abstract: as software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. given the significant successes of large language models (llms) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. however, a quantitative understanding of their potential in vulnerability detection is still missing. to bridge this gap, we introduce a comprehensive vulnerability benchmark vulbench. this benchmark aggregates high-quality data from a wide range of ctf (capture-the-flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. through our experiments encompassing 16 llms and 6 state-of-the-art (sota) deep learning-based models and static analyzers, we find that several llms outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in llms. this work contributes to the understanding and utilization of llms for enhanced software security.
Alessandro Castelnovo, Nicole Inverardi, Gabriele Nanino, Ilaria Giuseppina Penco, Daniele Regoli
Abstract: in the recent years, the raise in the usage and efficiency of artificial intelligence and, more in general, of automated decision-making systems has brought with it an increasing and welcome awareness of the risks associated with such systems. one of such risks is that of perpetuating or even amplifying bias and unjust disparities present in the data from which many of these systems learn to adjust and optimise their decisions. this awareness has on one side encouraged several scientific communities to come up with more and more appropriate ways and methods to assess, quantify, and possibly mitigate such biases and disparities. on the other hand, it has prompted more and more layers of society, including policy makers, to call for ``fair'' algorithms. we believe that while a lot of excellent and multidisciplinary research is currently being conducted, what is still fundamentally missing is the awareness that having ``fair'' algorithms is per s\'e a nearly meaningless requirement, that needs to be complemented with a lot of additional societal choices to become actionable. namely, there is a hiatus between what the society is demanding from automated decision-making systems, and what this demand actually means in real-world scenarios. in this work, we outline the key features of such a hiatus, and pinpoint a list of fundamental ambiguities and attention points that we as a society must address in order to give a concrete meaning to the increasing demand of fairness in automated decision-making systems.
Robert Gorwa, Michael Veale
Abstract: the ai development community is increasingly making use of hosting intermediaries such as hugging face provide easy access to user-uploaded models and training data. these model marketplaces lower technical deployment barriers for hundreds of thousands of users, yet can be used in numerous potentially harmful and illegal ways. in this article, we explain ways in which ai systems, which can both `contain' content and be open-ended tools, present one of the trickiest platform governance challenges seen to date. we provide case studies of several incidents across three illustrative platforms -- hugging face, github and civitai -- to examine how model marketplaces moderate models. building on this analysis, we outline important (and yet nevertheless limited) practices that industry has been developing to respond to moderation demands: licensing, access and use restrictions, automated content moderation, and open policy development. while the policy challenge at hand is a considerable one, we conclude with some ideas as to how platforms could better mobilize resources to act as a careful, fair, and proportionate regulatory access point.
Mengyang Chen, Lingwei Wei, Han Cao, Wei Zhou, Songlin Hu
Abstract: large language models (llms) have garnered significant attention for their powerful ability in natural language understanding and reasoning. in this paper, we present a comprehensive empirical study to explore the performance of llms on misinformation detection tasks. this study stands as the pioneering investigation into the understanding capabilities of multiple llms regarding both content and propagation across social media platforms. our empirical studies on five misinformation detection datasets show that llms with diverse prompts achieve comparable performance in text-based misinformation detection but exhibit notably constrained capabilities in comprehending propagation structure compared to existing models in propagation-based misinformation detection. besides, we further design four instruction-tuned strategies to enhance llms for both content and propagation-based misinformation detection. these strategies boost llms to actively learn effective features from multiple instances or hard instances, and eliminate irrelevant propagation structures, thereby achieving better detection performance. extensive experiments further demonstrate llms would play a better capacity in content and propagation structure under these proposed strategies and achieve promising detection performance. these findings highlight the potential ability of llms to detect misinformation.
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Scott Krueger
Abstract: fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? we address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. we perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. this indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. we additionally perform analysis on language models trained on the tinystories dataset to support our claims in a more realistic setup.
Yifan Yang, Yixian Zhang, Daoyang Li, Shuju Sun, Junhong Duan, Junzhou He, Qingyang Wu, Hao Liu
Abstract: geographic privacy, a crucial aspect of personal security, often goes unnoticed in daily activities. this paper addresses the underestimation of this privacy in the context of increasing online data sharing and the advancements in information gathering technologies. with the surge in the use of large multimodal models, such as gpt-4, for open source intelligence (osint), the potential risks associated with geographic privacy breaches have intensified. this study highlights the criticality of these developments, focusing on their implications for individual privacy. the primary objective is to demonstrate the capabilities of advanced ai tools, specifically a gpt-4 based model named "dr. watson," in identifying and potentially compromising geographic privacy through online shared content. we developed "dr. watson" to analyze and extract geographic information from publicly available data sources. the study involved five experimental cases, each offering different perspectives on the tool's application in extracting precise location data from partial images and social media content. the experiments revealed that "dr. watson" could successfully identify specific geographic details, thereby exposing the vulnerabilities in current geo-privacy measures. these findings underscore the ease with which geographic information can be unintentionally disclosed. the paper concludes with a discussion on the broader implications of these findings for individuals and the community at large. it emphasizes the urgency for enhanced awareness and protective measures against geo-privacy leakage in the era of advanced ai and widespread social media usage.
Alejandro Rodriguez Perez, Pablo Rivas
Abstract: this project tackles the pressing issue of human trafficking in online c2c marketplaces through advanced natural language processing (nlp) techniques. we introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art nlp models. focusing on tasks like human trafficking risk prediction (htrp) and organized activity detection (oad), we employ cutting-edge transformer models for analysis. a key contribution is the implementation of an interpretability framework using integrated gradients, providing explainable insights crucial for law enforcement. this work not only fills a critical gap in the literature but also offers a scalable, machine learning-driven approach to combat human exploitation online. it serves as a foundation for future research and practical applications, emphasizing the role of machine learning in addressing complex social issues.
Qinghua Lu, Liming Zhu, Xiwei Xu, Zhenchang Xing, Stefan Harrer, Jon Whittle
Abstract: large language models (llms) have been widely recognized as transformative technology due to their capabilities to understand and generate natural language text, including plans with some limited reasoning capabilities. llm-based agents derive their autonomy from the capabilities of llms, which enable them to autonomously break down the given goal into a set of manageable tasks and orchestrate the task execution to fulfill the goal. despite the huge efforts put into building llm-based autonomous agents, the architecture design of the agents has not yet been systematically explored. also, while there are significant benefits of using autonomous agents for planning and execution, there are serious considerations regarding responsible ai related software quality attributes, such as security and accountability. therefore, this paper presents a pattern-oriented reference architecture that serves as architecture design guidelines and enables responsible-ai-by-design when designing llm-based autonomous agents. we evaluate the completeness and utility of the proposed reference architecture by mapping it to the architecture of two real-world agents.
Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, Yueting Zhuang
Abstract: multi-modal large language models (mllms) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. however, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in mllms, remain under-explored. this work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, hallucidoctor, based on the cross-checking paradigm. we use our framework to identify and eliminate hallucinations in the training data automatically. interestingly, hallucidoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing mllms' resistance to hallucinations. comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to llava.the source code will be released at \url{https://github.com/yuqifan1117/hallucidoctor}.
Ben Pikus, Will Levine, Tony Chen, Sean Hendryx
Abstract: foundation models, specifically large language models (llm's), have lately gained wide-spread attention and adoption. reinforcement learning with human feedback (rlhf) involves training a reward model to capture desired behaviors, which is then used to align an llm. these reward models are additionally used at inference-time to estimate how well llm responses adhere to those desired behaviors. however, there is little work measuring how robust these reward models are to distribution shifts. in this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. we show novel calibration patterns and accuracy drops due to ood prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. additionally, we adapt an ood detection technique commonly used in classification to the reward model setting in order to detect these distribution shifts in prompts and responses.

2023-11-20

Thomas Rüdel, Jochen L. Leidner
Abstract: customer data typically is held in database systems, which can be seen as rule-based knowledge base, whereas businesses increasingly want to benefit from the capabilities of large, pre-trained language models. in this technical report, we describe a case study of how a commercial rule engine and an integrated neural chatbot may be integrated, and what level of control that particular integration mode leads to. we also discuss alternative ways (including past ways realized in other systems) how researchers strive to maintain control and avoid what has recently been called model "hallucination".
Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, Hang Su
Abstract: the rapid advancements in large language models (llms) have led to a resurgence in llm-based agents, which demonstrate impressive human-like behaviors and cooperative capabilities in various interactions and strategy formulations. however, evaluating the safety of llm-based agents remains a complex challenge. this paper elaborately conducts a series of manual jailbreak prompts along with a virtual chat-powered evil plan development team, dubbed evil geniuses, to thoroughly probe the safety aspects of these agents. our investigation reveals three notable phenomena: 1) llm-based agents exhibit reduced robustness against malicious attacks. 2) the attacked agents could provide more nuanced responses. 3) the detection of the produced improper responses is more challenging. these insights prompt us to question the effectiveness of llm-based attacks on agents, highlighting vulnerabilities at various levels and within different role specializations within the system/agent of llm-based agents. extensive evaluation and discussion reveal that llm-based agents face significant challenges in safety and yield insights for future research. our code is available at https://github.com/t1ans1r/evil-geniuses.
Panagiotis Liampas
Abstract: designing a perfect reward function that depicts all the aspects of the intended behavior is almost impossible, especially generalizing it outside of the training environments. active inverse reward design (aird) proposed the use of a series of queries, comparing possible reward functions in a single training environment. this allows the human to give information to the agent about suboptimal behaviors, in order to compute a probability distribution over the intended reward function. however, it ignores the possibility of unknown features appearing in real-world environments, and the safety measures needed until the agent completely learns the reward function. i improved this method and created risk-averse batch active inverse reward design (rbaird), which constructs batches, sets of environments the agent encounters when being used in the real world, processes them sequentially, and, for a predetermined number of iterations, asks queries that the human needs to answer for each environment of the batch. after this process is completed in one batch, the probabilities have been improved and are transferred to the next batch. this makes it capable of adapting to real-world scenarios and learning how to treat unknown features it encounters for the first time. i also integrated a risk-averse planner, similar to that of inverse reward design (ird), which samples a set of reward functions from the probability distribution and computes a trajectory that takes the most certain rewards possible. this ensures safety while the agent is still learning the reward function, and enables the use of this approach in situations where cautiousness is vital. rbaird outperformed the previous approaches in terms of efficiency, accuracy, and action certainty, demonstrated quick adaptability to new, unknown features, and can be more widely used for the alignment of crucial, powerful ai models.
Theodora Worledge, Judy Hanwen Shen, Nicole Meister, Caleb Winston, Carlos Guestrin
Abstract: as businesses, products, and services spring up around large language models, the trustworthiness of these models hinges on the verifiability of their outputs. however, methods for explaining language model outputs largely fall across two distinct fields of study which both use the term "attribution" to refer to entirely separate techniques: citation generation and training data attribution. in many modern applications, such as legal document generation and medical question answering, both types of attributions are important. in this work, we argue for and present a unified framework of large language model attributions. we show how existing methods of different types of attribution fall under the unified framework. we also use the framework to discuss real-world use cases where one or both types of attributions are required. we believe that this unified framework will guide the use case driven development of systems that leverage both types of attribution, as well as the standardization of their evaluation.

2023-11-19

Rahul Madhavan, Kahini Wadhawan
Abstract: we study attribute control in language models through the method of causal average treatment effect (causal ate). existing methods for the attribute control task in language models (lms) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. however, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. we show that the simple perturbation-based method of causal ate removes this unintended effect. additionally, we offer a theoretical foundation for investigating causal ate in the classification task, and prove that it reduces the number of false positives -- thereby mitigating the issue of unintended bias. specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. we show that this unintended bias can be solved by the use of the causal ate metric.
Jiaming Zhang, Xingjun Ma, Xin Wang, Lingyu Qiu, Jiaqi Wang, Yu-Gang Jiang, Jitao Sang
Abstract: with the rapid advancement of multimodal learning, pre-trained vision-language models (vlms) such as clip have demonstrated remarkable capacities in bridging the gap between visual and language modalities. however, these models remain vulnerable to adversarial attacks, particularly in the image modality, presenting considerable security risks. this paper introduces adversarial prompt tuning (advpt), a novel technique to enhance the adversarial robustness of image encoders in vlms. advpt innovatively leverages learnable text prompts and aligns them with adversarial image embeddings, to address the vulnerabilities inherent in vlms without the need for extensive parameter training or modification of the model architecture. we demonstrate that advpt improves resistance against white-box and black-box adversarial attacks and exhibits a synergistic effect when combined with existing image-processing-based defense techniques, further boosting defensive capabilities. comprehensive experimental analyses provide insights into adversarial prompt tuning, a novel paradigm devoted to improving resistance to adversarial images through textual input modifications, paving the way for future robust multimodal learning research. these findings open up new possibilities for enhancing the security of vlms. our code will be available upon publication of the paper.
Shaoxiong Ji, Tianlin Zhang, Kailai Yang, Sophia Ananiadou, Erik Cambria
Abstract: large language models (llms) have become valuable assets in mental health, showing promise in both classification tasks and counseling applications. this paper offers a perspective on using llms in mental health applications. it discusses the instability of generative models for prediction and the potential for generating hallucinatory outputs, underscoring the need for ongoing audits and evaluations to maintain their reliability and dependability. the paper also distinguishes between the often interchangeable terms ``explainability'' and ``interpretability'', advocating for developing inherently interpretable methods instead of relying on potentially hallucinated self-explanations generated by llms. despite the advancements in llms, human counselors' empathetic understanding, nuanced interpretation, and contextual awareness remain irreplaceable in the sensitive and complex realm of mental health counseling. the use of llms should be approached with a judicious and considerate mindset, viewing them as tools that complement human expertise rather than seeking to replace it.
Erik Derner, Kristina Batistič, Jan Zahálka, Robert Babuška
Abstract: as large language models (llms) permeate more and more applications, an assessment of their associated security risks becomes increasingly necessary. the potential for exploitation by malicious actors, ranging from disinformation to data breaches and reputation damage, is substantial. this paper addresses a gap in current research by focusing on the security risks posed by llms, which extends beyond the widely covered ethical and societal implications. our work proposes a taxonomy of security risks along the user-model communication pipeline, explicitly focusing on prompt-based attacks on llms. we categorize the attacks by target and attack type within a prompt-based interaction scheme. the taxonomy is reinforced with specific attack examples to showcase the real-world impact of these risks. through this taxonomy, we aim to inform the development of robust and secure llm applications, enhancing their safety and trustworthiness.
Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, Vishy Swaminathan
Abstract: in recent years, large language models (llm) have emerged as pivotal tools in various applications. however, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that lead to undesirable outputs. the inherent vulnerability of llms stems from their input-output mechanisms, especially when presented with intensely out-of-distribution (ood) inputs. this paper proposes a token-level detection method to identify adversarial prompts, leveraging the llm's capability to predict the next token's probability. we measure the degree of the model's perplexity and incorporate neighboring token information to encourage the detection of contiguous adversarial prompt sequences. as a result, we propose two methods: one that identifies each token as either being part of an adversarial prompt or not, and another that estimates the probability of each token being part of an adversarial prompt.
Jiahao Yu, Yuhang Wu, Dong Shu, Mingyu Jin, Xinyu Xing
Abstract: in the rapidly evolving landscape of artificial intelligence, chatgpt has been widely used in various applications. the new feature: customization of chatgpt models by users to cater to specific needs has opened new frontiers in ai utility. however, this study reveals a significant security vulnerability inherent in these user-customized gpts: prompt injection attacks. through comprehensive testing of over 200 user-designed gpt models via adversarial prompts, we demonstrate that these systems are susceptible to prompt injections. through prompt injection, an adversary can not only extract the customized system prompts but also access the uploaded files. this paper provides a first-hand analysis of the prompt injection, alongside the evaluation of the possible mitigation of such attacks. our findings underscore the urgent need for robust security frameworks in the design and deployment of customizable gpt models. the intent of this paper is to raise awareness and prompt action in the ai community, ensuring that the benefits of gpt customization do not come at the cost of compromised security and privacy.

2023-11-18

Wanqin Ma, Chenyang Yang, Christian Kästner
Abstract: large language models (llms) are increasingly integrated into software applications. downstream application developers often access llms through apis provided as a service. however, llm apis are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. this can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving llm apis. we argue that regression testing llms requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in llm apis.
Saizhuo Wang, Zhihan Liu, Zhaoran Wang, Jian Guo
Abstract: large language models (llms) are versatile, yet they often falter in tasks requiring deep and reliable reasoning due to issues like hallucinations, limiting their applicability in critical scenarios. this paper introduces a rigorously designed framework for creating llms that effectively anchor knowledge and employ a closed-loop reasoning process, enhancing their capability for in-depth analysis. we dissect the framework to illustrate the contribution of each component to the llms' performance, offering a theoretical assurance of improved reasoning under well-defined assumptions.
Zhaowei Zhu, Jialu Wang, Hao Cheng, Yang Liu
Abstract: language models have shown promise in various tasks but can be affected by undesired data during training, fine-tuning, or alignment. for example, if some unsafe conversations are wrongly annotated as safe ones, the model fine-tuned on these samples may be harmful. therefore, the correctness of annotations, i.e., the credibility of the dataset, is important. this study focuses on the credibility of real-world datasets, including the popular benchmarks jigsaw civil comments, anthropic harmless & red team, pku beavertails & saferlhf, that can be used for training a harmless language model. given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets, identifying label errors, and evaluating the influence of noisy labels in the curated language data, specifically focusing on unsafe comments and conversation classification. with the framework, we find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks. the data credibility and downstream learning performance can be remarkably improved by directly fixing label errors, indicating the significance of cleaning existing real-world datasets. open-source: https://github.com/docta-ai/docta.

2023-11-17

Yi Yang, Hanyu Duan, Ahmed Abbasi, John P. Lalor, Kar Yan Tam
Abstract: transformer-based pretrained large language models (plm) such as bert and gpt have achieved remarkable success in nlp tasks. however, plms are prone to encoding stereotypical biases. although a burgeoning literature has emerged on stereotypical bias mitigation in plms, such as work on debiasing gender and racial stereotyping, how such biases manifest and behave internally within plms remains largely unknown. understanding the internal stereotyping mechanisms may allow better assessment of model fairness and guide the development of effective mitigation strategies. in this work, we focus on attention heads, a major component of the transformer architecture, and propose a bias analysis framework to explore and identify a small set of biased heads that are found to contribute to a plm's stereotypical bias. we conduct extensive experiments to validate the existence of these biased heads and to better understand how they behave. we investigate gender and racial bias in the english language in two types of transformer-based plms: the encoder-based bert model and the decoder-based autoregressive gpt model. overall, the results shed light on understanding the bias behavior in pretrained language models.
Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau
Abstract: a prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. we propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. we a design a basic safety monitor that is flexible enough to monitor existing llm agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. then we apply the safety monitor on a battery of real-world tests of autogpt, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.
Daniel Russo, Shane Peter Kaszefski-Yaschuk, Jacopo Staiano, Marco Guerini
Abstract: the proliferation of misinformation on social media platforms (smps) poses a significant danger to public health, social cohesion and ultimately democracy. previous research has shown how social correction can be an effective way to curb misinformation, by engaging directly in a constructive dialogue with users who spread -- often in good faith -- misleading messages. although professional fact-checkers are crucial to debunking viral claims, they usually do not engage in conversations on social media. thereby, significant effort has been made to automate the use of fact-checker material in social correction; however, no previous work has tried to integrate it with the style and pragmatics that are commonly employed in social media communication. to fill this gap, we present vermouth, the first large-scale dataset comprising roughly 12 thousand claim-response pairs (linked to debunking articles), accounting for both smp-style and basic emotions, two factors which have a significant role in misinformation credibility and spreading. to collect this dataset we used a technique based on an author-reviewer pipeline, which efficiently combines llms and human annotators to obtain high-quality data. we also provide comprehensive experiments showing how models trained on our proposed dataset have significant improvements in terms of output quality and generalization capabilities.
David Thorstad
Abstract: traditional discussions of bias in large language models focus on a conception of bias closely tied to unfairness, especially as affecting marginalized groups. recent work raises the novel possibility of assessing the outputs of large language models for a range of cognitive biases familiar from research in judgment and decisionmaking. my aim in this paper is to draw two lessons from recent discussions of cognitive bias in large language models: cautious optimism about the prevalence of bias in current models coupled with an anti-panglossian willingness to concede the existence of some genuine biases and work to reduce them. i draw out philosophical implications of this discussion for the rationality of human cognitive biases as well as the role of unrepresentative data in driving model biases.
K. J. Kevin Feng, Quan Ze, N/A Chen, Inyoung Cheong, King Xia, Amy X. Zhang
Abstract: case studies commonly form the pedagogical backbone in law, ethics, and many other domains that face complex and ambiguous societal questions informed by human values. similar complexities and ambiguities arise when we consider how ai should be aligned in practice: when faced with vast quantities of diverse (and sometimes conflicting) values from different individuals and communities, with whose values is ai to align, and how should ai do so? we propose a complementary approach to constitutional ai alignment, grounded in ideas from case-based reasoning (cbr), that focuses on the construction of policies through judgments on a set of cases. we present a process to assemble such a case repository by: 1) gathering a set of ``seed'' cases -- questions one may ask an ai system -- in a particular domain from discussions in online communities, 2) eliciting domain-specific key dimensions for cases through workshops with domain experts, 3) using llms to generate variations of cases not seen in the wild, and 4) engaging with the public to judge and improve cases. we then discuss how such a case repository could assist in ai alignment, both through directly acting as precedents to ground acceptable behaviors, and as a medium for individuals and communities to engage in moral reasoning around ai

2023-11-16

Minbeom Kim, Jahyun Koo, Hwanhee Lee, Joonsuk Park, Hwaran Lee, Kyomin Jung
Abstract: as large language models become increasingly integrated into daily life, detecting implicit toxicity across diverse contexts is crucial. to this end, we introduce lifetox, a dataset designed for identifying implicit toxicity within a broad range of advice-seeking scenarios. unlike existing safety datasets, lifetox comprises diverse contexts derived from personal experiences through open-ended questions. experiments demonstrate that roberta fine-tuned on lifetox matches or surpasses the zero-shot performance of large language models in toxicity classification tasks. these results underscore the efficacy of lifetox in addressing the complex challenges inherent in implicit toxicity.
Yun-Shiuan Chuang, Agam Goyal, Nikunj Harlalka, Siddharth Suresh, Robert Hawkins, Sijia Yang, Dhavan Shah, Junjie Hu, Timothy T. Rogers
Abstract: accurately simulating human opinion dynamics is crucial for understanding a variety of societal phenomena, including polarization and the spread of misinformation. however, the agent-based models (abms) commonly used for such simulations lack fidelity to human behavior. we propose a new approach to simulating opinion dynamics based on populations of large language models (llms). our findings reveal a strong inherent bias in llm agents towards accurate information, leading to consensus in line with scientific reality. however, this bias limits the simulation of individuals with resistant views on issues like climate change. after inducing confirmation bias through prompt engineering, we observed opinion fragmentation in line with existing agent-based research. these insights highlight the promise and limitations of llm agents in this domain and suggest a path forward: refining llms with real-world discourse to better simulate the evolution of human beliefs.
Nakyeong Yang, Taegwan Kang, Kyomin Jung
Abstract: large language models (llms) executing tasks through instruction-based prompts often face challenges stemming from distribution differences between user instructions and training instructions. this leads to distractions and biases, especially when dealing with inconsistent dynamic labels. in this paper, we introduces a novel bias mitigation method, crispr, designed to alleviate instruction-label biases in llms. crispr utilizes attribution methods to identify bias neurons influencing biased outputs and employs pruning to eliminate the bias neurons. experimental results demonstrate the method's effectiveness in mitigating biases in instruction-based prompting, enhancing language model performance on social bias benchmarks without compromising pre-existing knowledge. crispr proves highly practical, model-agnostic, offering flexibility in adapting to evolving social biases.
Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, Chaowei Xiao
Abstract: reinforcement learning with human feedback (rlhf) is a methodology designed to align large language models (llms) with human preferences, playing an important role in llms alignment. despite its advantages, rlhf relies on human annotators to rank the text, which can introduce potential security vulnerabilities if any adversarial annotator (i.e., attackers) manipulates the ranking score by up-ranking any malicious text to steer the llm adversarially. to assess the red-teaming of rlhf against human preference data poisoning, we propose rankpoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors (e.g., generating longer sequences, which can increase the computational cost). with poisoned dataset generated by rankpoison, we can perform poisoning attacks on llms to generate longer tokens without hurting the original safety alignment performance. moreover, applying rankpoison, we also successfully implement a backdoor attack where llms can generate longer answers under questions with the trigger word. our findings highlight critical security challenges in rlhf, underscoring the necessity for more robust alignment methods for llms.
Yuhang Li, Yihan Wang, Zhouxing Shi, Cho-Jui Hsieh
Abstract: the strong general capabilities of large language models (llms) bring potential ethical risks if they are unrestrictedly accessible to malicious users. token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions with a private random number generator seeded by its prefix tokens. however, this watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality if it chooses to promote tokens that are less relevant given the input. in this work, we propose to improve the quality of texts generated by a watermarked language model by watermarking with importance scoring (wis). at each generation step, we estimate the importance of the token to generate, and prevent it from being impacted by watermarking if it is important for the semantic correctness of the output. we further propose three methods to predict importance scoring, including a perturbation-based method and two model-based methods. empirical experiments show that our method can generate texts with better quality with comparable level of detection rate.
Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, Tong Zhang
Abstract: large language models (llms) have revolutionized numerous domains with their impressive performance but still face their challenges. a predominant issue is the propensity for these models to generate non-existent facts, a concern termed hallucination. our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. when the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. in this paper, we present a new approach called refusal-aware instruction tuning (r-tuning). this approach is formalized by first identifying the knowledge gap between parametric knowledge and the instruction tuning data. then, we construct the refusal-aware data based on the knowledge intersection, to tune llms to refrain from responding to questions beyond its parametric knowledge. experimental results demonstrate this new instruction tuning approach effectively improves a model's ability to answer known questions and refrain from answering unknown questions. furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. further analysis surprisingly finds that learning the uncertainty during training displays a better ability to estimate uncertainty than uncertainty-based testing. our code will be released at https://github.com/shizhediao/r-tuning.
Ziyan Guo, Jun Liu
Abstract: the rapid progress of large models (lms) has recently revolutionized various fields of deep learning with remarkable grades, ranging from natural language processing (nlp) to computer vision (cv). however, lms are increasingly challenged and criticized by academia and industry due to their powerful performance but untrustworthy behavior, which urgently needs to be alleviated in reliable methods. despite the abundance of literature on trustworthy lms in language, a systematic survey specifically delving into the trustworthiness of lms in vision remains absent. in order to mitigate this gap, we summarize four relevant concerns that obstruct the trustworthy usage in vision of lms in this survey, including 1) human misuse, 2) vulnerability, 3) inherent issue and 4) interpretability. by highlighting corresponding challenge, countermeasures, and discussion in each topic, we hope this survey will facilitate readers' understanding of the field, promote alignment of lms with human expectations and enable trustworthy lms to serve as welfare rather than disaster for human society.
Zihao He, Siyi Guo, Ashwin Rao, Kristina Lerman
Abstract: social media platforms are rife with politically charged discussions. therefore, accurately deciphering and predicting partisan biases using large language models (llms) is increasingly critical. in this study, we address the challenge of understanding political bias in digitized discourse using llms. while traditional approaches often rely on finetuning separate models for each political faction, our work innovates by employing a singular, instruction-tuned llm to reflect a spectrum of political ideologies. we present a comprehensive analytical framework, consisting of partisan bias divergence assessment and partisan class tendency prediction, to evaluate the model's alignment with real-world political ideologies in terms of stances, emotions, and moral foundations. our findings reveal the model's effectiveness in capturing emotional and moral nuances, albeit with some challenges in stance detection, highlighting the intricacies and potential for refinement in nlp tools for politically sensitive contexts. this research contributes significantly to the field by demonstrating the feasibility and importance of nuanced political understanding in llms, particularly for applications requiring acute awareness of political bias.
Ashim Gupta, Rishanth Rajendhran, Nathan Stringham, Vivek Srikumar, Ana Marasović
Abstract: are the longstanding robustness issues in nlp resolved by today's larger and more performant models? to address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. we conduct evaluations using (a) ood and challenge test sets, (b) checklists, (c) contrast sets, and (d) adversarial inputs. our analysis reveals that not all ood tests provide further insight into robustness. evaluating with checklists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. we conclude that not only is the question of robustness in nlp as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens
Abstract: human perception of language depends on personal backgrounds like gender and ethnicity. while existing studies have shown that large language models (llms) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective nlp tasks also exhibit a similar bias. in this study, leveraging the popquorn dataset which contains annotations of diverse demographic backgrounds, we conduct a series of experiments on four popular llms to investigate their capability to understand group differences and potential biases in their predictions for politeness and offensiveness. we find that for both tasks, model predictions are closer to the labels from white and female participants. we further explore prompting with the target demographic labels and show that including the target demographic in the prompt actually worsens the model's performance. more specifically, when being prompted to respond from the perspective of "black" and "asian" individuals, models show lower performance in predicting both overall scores as well as the scores from corresponding groups. our results suggest that llms hold gender and racial biases for subjective nlp tasks and that demographic-infused prompts alone may be insufficient to mitigate such effects. code and data are available at https://github.com/jiaxin-pei/llm-group-bias.
Genglin Liu, Xingyao Wang, Lifan Yuan, Yangyi Chen, Hao Peng
Abstract: large language models (llms) often struggle when faced with situations where they lack the prerequisite knowledge to generate a sensical response. in these cases, models tend to fabricate and hallucinate, rather than appropriately signaling uncertainty as humans would. this behavior misaligns with human conversational norms and presents challenges surrounding responsible and ethical ai development. this work aims to systematically investigate llms' behaviors in such situations. we curate an adversarial question-answering benchmark containing unanswerable questions targeting information absent from the llm's training data. concretely, these unanswerable questions contain non-existent concepts or false premises. when presented with such unanswerable questions, an llm should appropriately convey uncertainty, and be able to challenge the premise and refuse to generate a response. while facing answerable valid questions, a model should demonstrate a positive correlation between accuracy and confidence. using a model-agnostic unified confidence elicitation approach, we observe that llms that have gone through instruction finetuning and reinforcement learning from human feedback (rlhf) perform significantly better than their counterparts that do not. moreover, uncertainty expression 1 through our elicitation method does not always stay consistent with the perceived confidence of the direct response of an llm. our findings call for further research into teaching llms to proactively and reliably express uncertainty.
Wenjie Mo, Jiashu Xu, Qin Liu, Jiongxiao Wang, Jun Yan, Chaowei Xiao, Muhao Chen
Abstract: existing studies in backdoor defense have predominantly focused on the training phase, overlooking the critical aspect of testing time defense. this gap becomes particularly pronounced in the context of large language models (llms) deployed as web services, which typically offer only black-box access, rendering training-time defenses impractical. to bridge this gap, our work introduces defensive demonstrations, an innovative backdoor defense strategy for blackbox large language models. our method involves identifying the task and retrieving task-relevant demonstrations from an uncontaminated pool. these demonstrations are then combined with user queries and presented to the model during testing, without requiring any modifications/tuning to the black-box model or insights into its internal mechanisms. defensive demonstrations are designed to counteract the adverse effects of triggers, aiming to recalibrate and correct the behavior of poisoned models during test-time evaluations. extensive experiments show that defensive demonstrations are effective in defending both instance-level and instruction-level backdoor attacks, not only rectifying the behavior of poisoned models but also surpassing existing baselines in most scenarios.
Evgeniia Razumovskaia, Ivan Vulić, Pavle Marković, Tomasz Cichy, Qian Zheng, Tsung-Hsien Wen, Paweł Budzianowski
Abstract: factuality is a crucial requirement in information seeking dialogue: the system should respond to the user's queries so that the responses are meaningful and aligned with the knowledge provided to the system. however, most modern large language models suffer from hallucinations, that is, they generate responses not supported by or contradicting the knowledge source. to mitigate the issue and increase faithfulness of information-seeking dialogue systems, we introduce beinfo, a simple yet effective method that applies behavioural tuning to aid information-seeking dialogue. relying on three standard datasets, we show that models tuned with beinfo} become considerably more faithful to the knowledge source both for datasets and domains seen during beinfo-tuning, as well as on unseen domains, when applied in a zero-shot manner. in addition, we show that the models with 3b parameters (e.g., flan-t5) tuned with beinfo demonstrate strong performance on data from real `production' conversations and outperform gpt4 when tuned on a limited amount of such realistic in-domain dialogues.
Nan Xu, Fei Wang, Ben Zhou, Bang Zheng Li, Chaowei Xiao, Muhao Chen
Abstract: while large language models (llms) have demonstrated increasing power, they have also given rise to a wide range of harmful behaviors. as representatives, jailbreak attacks can provoke harmful or unethical responses from llms, even after safety alignment. in this paper, we investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of llms. specifically, we analyze the safety vulnerability of llms in the face of (1) multilingual cognitive overload, (2) veiled expression, and (3) effect-to-cause reasoning. different from previous jailbreak attacks, our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights. experiments conducted on advbench and masterkey reveal that various llms, including both popular open-source model llama 2 and the proprietary model chatgpt, can be compromised through cognitive overload. motivated by cognitive psychology work on managing cognitive load, we further investigate defending cognitive overload attack from two perspectives. empirical studies show that our cognitive overload from three perspectives can jailbreak all studied llms successfully, while existing defense strategies can hardly mitigate the caused malicious uses effectively.
Yao Qiang, Xiangyu Zhou, Dongxiao Zhu
Abstract: in-context learning (icl) has emerged as a powerful paradigm leveraging llms for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. despite its promising performance, icl suffers from instability with the choice and arrangement of examples. additionally, crafted adversarial attacks pose a notable threat to the robustness of icl. however, existing attacks are either easy to detect, rely on external models, or lack specificity towards icl. to address these issues, this work introduces a novel transferable attack for icl, aiming to hijack llms to generate the targeted response. the proposed llm hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. extensive experimental results on various tasks and datasets demonstrate the effectiveness of our llm hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.
Sagi Pendzel, Tomer Wullach, Amir Adler, Einat Minkov
Abstract: automatic hate speech detection using deep neural models is hampered by the scarcity of labeled datasets, leading to poor generalization. to mitigate this problem, generative ai has been utilized to generate large amounts of synthetic hate speech sequences from available labeled examples, leveraging the generated data in finetuning large pre-trained language models (llms). in this chapter, we provide a review of relevant methods, experimental setups and evaluation of this approach. in addition to general llms, such as bert, roberta and albert, we apply and evaluate the impact of train set augmentation with generated data using llms that have been already adapted for hate detection, including roberta-toxicity, hatebert, hatexplain, toxdect, and toxigen. an empirical study corroborates our previous findings, showing that this approach improves hate speech generalization, boosting recall performance across data distributions. in addition, we explore and compare the performance of the finetuned llms with zero-shot hate detection using a gpt-3.5 model. our results demonstrate that while better generalization is achieved using the gpt-3.5 model, it achieves mediocre recall and low precision on most datasets. it is an open question whether the sensitivity of models such as gpt-3.5, and onward, can be improved using similar techniques of text generation.
Kathrin Grosse, Lukas Bieringer, Tarek Richard Besold, Alexandre Alahi
Abstract: recent works have identified a gap between research and practice in artificial intelligence security: threats studied in academia do not always reflect the practical use and security risks of ai. for example, while models are often studied in isolation, they form part of larger ml pipelines in practice. recent works also brought forward that adversarial manipulations introduced by academic attacks are impractical. we take a first step towards describing the full extent of this disparity. to this end, we revisit the threat models of the six most studied attacks in ai security research and match them to ai usage in practice via a survey with \textbf{271} industrial practitioners. on the one hand, we find that all existing threat models are indeed applicable. on the other hand, there are significant mismatches: research is often too generous with the attacker, assuming access to information not frequently available in real-world settings. our paper is thus a call for action to study more practical threat models in artificial intelligence security.
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Ji, Ajay Divakaran
Abstract: we present dress, a large vision language model (lvlm) that innovatively exploits natural language feedback (nlf) from large language models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art lvlms. first, prior lvlms generally rely only on the instruction finetuning stage to enhance alignment with human preferences. without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. this reduces the capacity for effective multi-turn interactions. to tackle these, we propose a novel categorization of the nlf into two key types: critique and refinement. the critique nlf identifies the strengths and weaknesses of the responses and is used to align the lvlms with human preferences. the refinement nlf offers concrete suggestions for improvement and is adopted to improve the interaction ability of the lvlms-- which focuses on lvlms' ability to refine responses by incorporating feedback in multi-turn interactions. to address the non-differentiable nature of nlf, we generalize conditional reinforcement learning for training. our experimental results demonstrate that dress can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to sota lvmls.
Ivan Flechais, George Chalhoub
Abstract: research into the ethics of cybersecurity is an established and growing topic of investigation, however the translation of this research into practice is lacking: there exists a small number of professional codes of ethics or codes of practice in cybersecurity, however these are very broad and do not offer much insight into the ethical dilemmas that can be faced while performing specific cybersecurity activities. in order to address this gap, we leverage ongoing work on the cyber security body of knowledge (cybok) to help elicit and document the responsibilities and ethics of the profession. based on a literature review of the ethics of cybersecurity, we use cybok to frame the exploration of ethical challenges in the cybersecurity profession through a series of 15 interviews with cybersecurity experts. our approach is qualitative and exploratory, aiming to answer the research question "what ethical challenges, insights, and solutions arise in different areas of cybersecurity?". our findings indicate that there are broad ethical challenges across the whole of cybersecurity, but also that different areas of cybersecurity can face specific ethical considerations for which more detailed guidance can help professionals in those areas. in particular, our findings indicate that security decision-making is expected of all security professionals, but that this requires them to balance a complex mix of technical, objective and subjective points of view, and that resolving conflicts raises challenging ethical dilemmas. we conclude that more work is needed to explore, map, and integrate ethical considerations into cybersecurity practice; the urgent need to conduct further research into the ethics of cybersecurity ai; and highlight the importance of this work for individuals and professional bodies who seek to develop and mature the cybersecurity profession in a responsible manner.
Ambri Ma, Arnav Kumar, Brett Zeligson
Abstract: the training of large language models (llms) on extensive, unfiltered corpora sourced from the internet is a common and advantageous practice. consequently, llms have learned and inadvertently reproduced various types of biases, including violent, offensive, and toxic language. however, recent research shows that generative pretrained transformer (gpt) language models can recognize their own biases and detect toxicity in generated content, a process referred to as self-diagnosis. in response, researchers have developed a decoding algorithm that allows llms to self-debias, or reduce their likelihood of generating harmful text. this study investigates the efficacy of the diagnosing-debiasing approach in mitigating two additional types of biases: insults and political bias. these biases are often used interchangeably in discourse, despite exhibiting potentially dissimilar semantic and syntactic properties. we aim to contribute to the ongoing effort of investigating the ethical and social implications of human-ai interaction.

2023-11-15

Ethan Shaotran, Ido Pesok, Sam Jones, Emi Liu
Abstract: we are introducing aligned, a platform for global governance and alignment of frontier models, and eventually superintelligence. while previous efforts at the major ai labs have attempted to gather inputs for alignment, these are often conducted behind closed doors. we aim to set the foundation for a more trustworthy, public-facing approach to safety: a constitutional committee framework. initial tests with 680 participants result in a 30-guideline constitution with 93% overall support. we show the platform naturally scales, instilling confidence and enjoyment from the community. we invite other ai labs and teams to plug and play into the aligned ecosystem.
Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, Yang Zhang
Abstract: uncertainty decomposition refers to the task of decomposing the total uncertainty of a model into data (aleatoric) uncertainty, resulting from the inherent complexity or ambiguity of the data, and model (epistemic) uncertainty, resulting from the lack of knowledge in the model. performing uncertainty decomposition for large language models (llms) is an important step toward improving the reliability, trustworthiness, and interpretability of llms, but this research task is very challenging and remains unresolved. the existing canonical method, bayesian neural network (bnn), cannot be applied to llms, because bnn requires training and ensembling multiple variants of models, which is infeasible or prohibitively expensive for llms. in this paper, we introduce an uncertainty decomposition framework for llms, called input clarifications ensemble, which bypasses the need to train new models. rather than ensembling models with different parameters, our approach generates a set of clarifications for the input, feeds them into the fixed llms, and ensembles the corresponding predictions. we show that our framework shares a symmetric decomposition structure with bnn. empirical evaluations demonstrate that the proposed framework provides accurate and reliable uncertainty quantification on various tasks. code will be made publicly available at https://github.com/ucsb-nlp-chang/llm_uncertainty .
Minze Chen, Zhenxiang Tao, Weitong Tang, Tingxin Qin, Rui Yang, Chunli Zhu
Abstract: emergency management urgently requires comprehensive knowledge while having a high possibility to go beyond individuals' cognitive scope. therefore, artificial intelligence(ai) supported decision-making under that circumstance is of vital importance. recent emerging large language models (llm) provide a new direction for enhancing targeted machine intelligence. however, the utilization of llm directly would inevitably introduce unreliable output for its inherent issue of hallucination and poor reasoning skills. in this work, we develop a system called enhancing emergency decision-making with knowledge graph and llm (e-kell), which provides evidence-based decision-making in various emergency stages. the study constructs a structured emergency knowledge graph and guides llms to reason over it via a prompt chain. in real-world evaluations, e-kell receives scores of 9.06, 9.09, 9.03, and 9.09 in comprehensibility, accuracy, conciseness, and instructiveness from a group of emergency commanders and firefighters, demonstrating a significant improvement across various situations compared to baseline models. this work introduces a novel approach to providing reliable emergency decision support.
Ivan Vykopal, Matúš Pikuliak, Ivan Srba, Robert Moro, Dominik Macko, Maria Bielikova
Abstract: automated disinformation generation is often listed as one of the risks of large language models (llms). the theoretical ability to flood the information space with disinformation content might have dramatic consequences for democratic societies around the world. this paper presents a comprehensive study of the disinformation capabilities of the current generation of llms to generate false news articles in english language. in our study, we evaluated the capabilities of 10 llms using 20 disinformation narratives. we evaluated several aspects of the llms: how well they are at generating news articles, how strongly they tend to agree or disagree with the disinformation narratives, how often they generate safety warnings, etc. we also evaluated the abilities of detection models to detect these articles as llm-generated. we conclude that llms are able to generate convincing news articles that agree with dangerous disinformation narratives.
Vaishnavi Shrivastava, Percy Liang, Ananya Kumar
Abstract: to maintain user trust, large language models (llms) should signal low confidence on examples where they are incorrect, instead of misleading the user. the standard approach of estimating confidence is to use the softmax probabilities of these models, but as of november 2023, state-of-the-art llms such as gpt-4 and claude-v1.3 do not provide access to these probabilities. we first study eliciting confidence linguistically -- asking an llm for its confidence in its answer -- which performs reasonably (80.5% auc on gpt-4 averaged across 12 question-answering datasets -- 7% above a random baseline) but leaves room for improvement. we then explore using a surrogate confidence model -- using a model where we do have probabilities to evaluate the original model's confidence in a given question. surprisingly, even though these probabilities come from a different and often weaker model, this method leads to higher auc than linguistic confidences on 9 out of 12 datasets. our best method composing linguistic confidences and surrogate model probabilities gives state-of-the-art confidence estimates on all 12 datasets (84.6% average auc on gpt-4).
Kerianne L. Hobbs, Bernard Li
Abstract: designing a safe, trusted, and ethical ai may be practically impossible; however, designing ai with safe, trusted, and ethical use in mind is possible and necessary in safety and mission-critical domains like aerospace. safe, trusted, and ethical use of ai are often used interchangeably; however, a system can be safely used but not trusted or ethical, have a trusted use that is not safe or ethical, and have an ethical use that is not safe or trusted. this manuscript serves as a primer to illuminate the nuanced differences between these concepts, with a specific focus on applications of human-ai teaming in aerospace system control, where humans may be in, on, or out-of-the-loop of decision-making.
Marta Marchiori Manerba, Karolina Stańczak, Riccardo Guidotti, Isabelle Augenstein
Abstract: large language models have been shown to encode a variety of social biases, which carries the risk of downstream harms. while the impact of these biases has been recognized, prior methods for bias evaluation have been limited to binary association tests on small datasets, offering a constrained view of the nature of societal biases within language models. in this paper, we propose an original framework for probing language models for societal biases. we collect a probing dataset to analyze language models' general associations, as well as along the axes of societal categories, identities, and stereotypes. to this end, we leverage a novel perplexity-based fairness score. we curate a large-scale benchmarking dataset addressing drawbacks and limitations of existing fairness collections, expanding to a variety of different identities and stereotypes. when comparing our methodology with prior work, we demonstrate that biases within language models are more nuanced than previously acknowledged. in agreement with recent findings, we find that larger model variants exhibit a higher degree of bias. moreover, we expose how identities expressing different religions lead to the most pronounced disparate treatments across all models.
Zhexin Zhang, Junxiao Yang, Pei Ke, Minlie Huang
Abstract: large language models (llms) continue to advance in their capabilities, yet this progress is accompanied by a growing array of safety risks. while significant attention has been dedicated to exploiting weaknesses in llms through jailbreaking attacks, there remains a paucity of exploration into defending against these attacks. we point out a pivotal factor contributing to the success of jailbreaks: the inherent conflict between the goals of being helpful and ensuring safety. to counter jailbreaking attacks, we propose to integrate goal prioritization at both training and inference stages. implementing goal prioritization during inference substantially diminishes the attack success rate (asr) of jailbreaking attacks, reducing it from 66.4% to 2.0% for chatgpt and from 68.2% to 19.4% for vicuna-33b, without compromising general performance. furthermore, integrating the concept of goal prioritization into the training phase reduces the asr from 71.0% to 6.6% for llama2-13b. remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the asr by half, decreasing it from 71.0% to 34.0%. additionally, our findings reveal that while stronger llms face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks. we hope our work could contribute to the comprehension of jailbreaking attacks and defenses, and shed light on the relationship between llms' capability and safety. our code will be available at \url{https://github.com/thu-coai/jailbreakdefense_goalpriority}.
Haoqiang Kang, Juntong Ni, Huaxiu Yao
Abstract: large language models (llms) have demonstrated remarkable proficiency in generating fluent text. however, they often encounter the challenge of generating inaccurate or hallucinated content. this issue is common in both non-retrieval-based generation and retrieval-augmented generation approaches, and existing post-hoc rectification methods may not address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks. to tackle these challenges, we introduce a novel approach called real-time verification and rectification (ever). instead of waiting until the end of the generation process to rectify hallucinations, ever employs a real-time, step-wise generation and hallucination rectification strategy. the primary objective is to detect and rectify hallucinations as they occur during the text generation process. when compared to both retrieval-based and non-retrieval-based baselines, ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks, including short-form qa, biography generation, and multi-hop reasoning.
Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, Lichao Sun
Abstract: existing work on jailbreak multimodal large language models (mllms) has focused primarily on adversarial examples in model inputs, with less attention to vulnerabilities in model apis. to fill the research gap, we carry out the following work: 1) we discover a system prompt leakage vulnerability in gpt-4v. through carefully designed dialogue, we successfully steal the internal system prompts of gpt-4v. this finding indicates potential exploitable security risks in mllms; 2)based on the acquired system prompts, we propose a novel mllm jailbreaking attack method termed sasp (self-adversarial attack via system prompt). by employing gpt-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts. furthermore, in pursuit of better performance, we also add human modification based on gpt-4's analysis, which further improves the attack success rate to 98.7\%; 3) we evaluated the effect of modifying system prompts to defend against jailbreaking attacks. results show that appropriately designed system prompts can significantly reduce jailbreak success rates. overall, our work provides new insights into enhancing mllm security, demonstrating the important role of system prompts in jailbreaking, which could be leveraged to greatly facilitate jailbreak success rates while also holding the potential for defending against jailbreaks.
Lucas Torroba Hennigen, Shannon Shen, Aniruddha Nrusimha, Bernhard Gapp, David Sontag, Yoon Kim
Abstract: large language models (llms) have demonstrated an impressive ability to synthesize plausible and fluent text. however they remain vulnerable to hallucinations, and thus their outputs generally require manual human verification for high-stakes applications, which can be time-consuming and difficult. this paper proposes symbolically grounded generation (symgen) as a simple approach for enabling easier validation of an llm's output. symgen prompts an llm to interleave its regular output text with explicit symbolic references to fields present in some conditioning data (e.g., a table in json format). the references can be used to display the provenance of different spans of text in the generation, reducing the effort required for manual verification. across data-to-text and question answering experiments, we find that llms are able to directly output text that makes use of symbolic references while maintaining fluency and accuracy.
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, Dong Yu
Abstract: retrieval-augmented language models (ralms) represent a substantial advancement in the capabilities of large language models, notably in reducing factual hallucination by leveraging external knowledge sources. however, the reliability of the retrieved information is not always guaranteed. the retrieval of irrelevant data can lead to misguided responses, and potentially causing the model to overlook its inherent knowledge, even when it possesses adequate information to address the query. moreover, standard ralms often struggle to assess whether they possess adequate knowledge, both intrinsic and retrieved, to provide an accurate answer. in situations where knowledge is lacking, these systems should ideally respond with "unknown" when the answer is unattainable. in response to these challenges, we introduces chain-of-noting (con), a novel approach aimed at improving the robustness of ralms in facing noisy, irrelevant documents and in handling unknown scenarios. the core idea of con is to generate sequential reading notes for retrieved documents, enabling a thorough evaluation of their relevance to the given question and integrating this information to formulate the final answer. we employed chatgpt to create training data for con, which was subsequently trained on an llama-2 7b model. our experiments across four open-domain qa benchmarks show that ralms equipped with con significantly outperform standard ralms. notably, con achieves an average improvement of +7.9 in em score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu
Abstract: large language models (llms) have achieved remarkable advancements in the field of natural language processing. however, the sheer scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained contexts. while techniques such as chain-of-thought (cot) distillation have displayed promise in distilling llms into small language models (slms), there is a risk that distilled slms may still carry over flawed reasoning or hallucinations inherited from their llm counterparts. to address these issues, we propose a twofold methodology: first, we introduce a novel method for distilling the self-evaluation capability inherent in llms into slms, which aims to mitigate the adverse effects of erroneous reasoning and reduce hallucinations. second, we advocate for a comprehensive distillation process that incorporates multiple distinct chain-of-thought and self-evaluation paradigms and ensures a more holistic and robust knowledge transfer into slms. experiments on three nlp benchmarks demonstrate that our method significantly improves the performance of distilled slms and sheds light on the path towards developing smaller models closely aligned with human cognition.
Leonardo Ranaldi, Giulia Pucci
Abstract: large language models (llms) have been demonstrating the ability to solve complex tasks by delivering answers that are positively evaluated by humans due in part to the intensive use of human feedback that refines responses. however, the suggestibility transmitted through human feedback increases the inclination to produce responses that correspond to the user's beliefs or misleading prompts as opposed to true facts, a behaviour known as sycophancy. this phenomenon decreases the bias, robustness, and, consequently, their reliability. in this paper, we shed light on the suggestibility of llms to sycophantic behaviour, demonstrating these tendencies via human-influenced prompts over different tasks. our investigation reveals that llms show sycophantic tendencies when responding to queries involving subjective opinions and statements that should elicit a contrary response based on facts, demonstrating a lack of robustness.
Yuekun Yao, Alexander Koller
Abstract: the ability to predict an nlp model's accuracy on unseen, potentially out-of-distribution data is a prerequisite for trustworthiness. we present a novel model that establishes upper and lower bounds on the accuracy, without requiring gold labels for the unseen data. we achieve this by training a discriminator which predicts whether the output of a given sequence-to-sequence model is correct or not. we show across a variety of tagging, parsing, and semantic parsing tasks that the gold accuracy is reliably between the predicted upper and lower bounds, and that these bounds are remarkably close together.
Yueqing Liang, Lu Cheng, Ali Payani, Kai Shu
Abstract: this work investigates the potential of undermining both fairness and detection performance in abusive language detection. in a dynamic and complex digital world, it is crucial to investigate the vulnerabilities of these detection models to adversarial fairness attacks to improve their fairness robustness. we propose a simple yet effective framework fable that leverages backdoor attacks as they allow targeted control over the fairness and detection performance. fable explores three types of trigger designs (i.e., rare, artificial, and natural triggers) and novel sampling strategies. specifically, the adversary can inject triggers into samples in the minority group with the favored outcome (i.e., ``non-abusive'') and flip their labels to the unfavored outcome, i.e., ``abusive''. experiments on benchmark datasets demonstrate the effectiveness of fable attacking fairness and utility in abusive language detection.
Haoran Wang, Kai Shu
Abstract: to ensure ai safety, instruction-tuned large language models (llms) are specifically trained to ensure alignment, which refers to making models behave in accordance with human intentions. while these models have demonstrated commendable results on various safety benchmarks, the vulnerability of their safety alignment has not been extensively studied. this is particularly troubling given the potential harm that llms can inflict. existing attack methods on llms often rely on poisoned training data or the injection of malicious prompts. these approaches compromise the stealthiness and generalizability of the attacks, making them susceptible to detection. additionally, these models often demand substantial computational resources for implementation, making them less practical for real-world applications. in this work, we introduce a novel attack framework, called backdoor activation attack, which injects trojan steering vectors into the activation layers of llms. these malicious steering vectors can be triggered at inference time to steer the models toward attacker-desired behaviors by manipulating their activations. in particular, the steering vectors are generated by taking the difference between benign and malicious activations. then, the most effective steering vector is selected and added to the forward passes of the llms. our experiment results on four primary alignment tasks show that our proposed method is highly effective and adds little or no overhead to attack efficiency. additionally, we discuss potential countermeasures against such activation attacks. our code and data are available at https://email-haoran-for-link. warning: this paper contains content that can be offensive or upsetting.
Brooklyn Sheppard, Anna Richter, Allison Cohen, Elizabeth Allyn Smith, Tamara Kneese, Carolyne Pelletier, Ioana Baldini, Yue Dong
Abstract: using novel approaches to dataset development, the biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expressions of misogyny in north american film. the dataset can be used for a range of nlp tasks, including classification, severity score regression, and text generation for rewrites. in this paper, we discuss the methodology used, analyze the annotations obtained, and provide baselines using common nlp algorithms in the context of misogyny detection and mitigation. we hope this work will promote ai for social good in nlp for bias detection, explanation, and removal.
Lingbo Mo, Boshi Wang, Muhao Chen, Huan Sun
Abstract: the rapid progress in open-source large language models (llms) is significantly driving ai development forward. however, there is still a limited understanding of their trustworthiness. deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. in this work, we conduct an assessment of open-source llms on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. we propose an enhanced chain of utterances-based (cou) prompting strategy by incorporating meticulously crafted malicious demonstrations for trustworthiness attack. our extensive experiments encompass recent and representative series of open-source llms, including vicuna, mpt, falcon, mistral, and llama 2. the empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. more interestingly, our result analysis reveals that models with superior performance in general nlp tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning llms for safety alignment proves effective in mitigating adversarial trustworthiness attacks.
Anthony Aguirre
Abstract: in the coming years, humanity may irreversibly cross a threshold by creating superhuman general-purpose artificial intelligence. this would present many unprecedented risks and is likely to be uncontrollable in several ways. we can choose not to do so, starting by instituting hard limits on the computation that can be used to train and run neural networks. with these limits in place, ai research and industry can work on making ai that humans can understand and control, and from which we can reap enormous benefit.
Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Jwala Dhamala, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Abstract: with the recent surge of language models in different applications, attention to safety and robustness of these models has gained significant importance. here we introduce a joint framework in which we simultaneously probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation using iterative feedback loops. this framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes. importantly, the adversarial model and the belief generator leverage the feedback from past interactions to improve the effectiveness of the adversarial prompts and beliefs, respectively. in our experiments, we demonstrate that such a framework can reduce toxic content generation both in dynamic cases where an adversary directly interacts with a target model and static cases where we use a static benchmark dataset to evaluate our model.
Shuai Li, Kejiang Chen, Kunsheng Tang, Wen Huang, Jie Zhang, Weiming Zhang, Nenghai Yu
Abstract: large language models (llms) have demonstrated superior performance in various natural language processing tasks. meanwhile, they require extensive training data, raising concerns related to dataset copyright protection. backdoor-based watermarking is a viable approach to protect the copyright of classification datasets. however, these methods may introduce malicious misclassification behaviors into watermarked llms by attackers and also affect the semantic information of the watermarked text. to address these issues, we propose functionmarker, a novel copyright protection method for language datasets via knowledge injection. functionmarker enables llms to learn specific knowledge through fine-tuning on watermarked datasets, and we can extract the embedded watermark by obtaining the responses of llms to specific knowledge-related queries. considering watermark capacity and stealthness, we select customizable functions as specific knowledge for llms to learn and embed the watermark into them. moreover, functionmarker can embed multi-bit watermarks while preserving the original semantic information, thereby increasing the difficulty of adaptive attacks. we take mathematical functions as an instance to evaluate the effectiveness of functionmarker, and experiments show that only 0.3% of watermarked text achieves a 90% watermark extraction accuracy in most cases, validating our method's effectiveness.
Yao Dou, Isadora Krsek, Tarek Naous, Anubha Kabra, Sauvik Das, Alan Ritter, Wei Xu
Abstract: self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. in this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through identification and abstraction. we develop a taxonomy of 19 self-disclosure categories, and curate a large corpus consisting of 4.8k annotated disclosure spans. we then fine-tune a language model for identification, achieving over 75% in token f$_1$. we further conduct a hci user study, with 82\% of participants viewing the model positively, highlighting its real world applicability. motivated by the user feedback, we introduce the task of self-disclosure abstraction. we experiment with both one-span abstraction and three-span abstraction settings, and explore multiple fine-tuning strategies. our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation.

2023-11-14

Garima Agrawal, Tharindu Kumarage, Zeyad Alghami, Huan Liu
Abstract: the contemporary llms are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. to address this critical limitation, researchers employ diverse strategies to augment the llms by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. in this survey, we conduct a comprehensive review of these knowledge-graph-based knowledge augmentation techniques in llms, focusing on their efficacy in mitigating hallucinations. we systematically categorize these methods into three overarching groups, offering both methodological comparisons and empirical evaluations of their performance. lastly, the paper explores the challenges associated with these techniques and outlines potential avenues for future research in this emerging field.
Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz
Abstract: in recent years, large language models (llms) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? a popular concept, referred to as self-refinement, postulates that llms can detect and correct the errors in their generations when asked to do so. however, recent empirical evidence points in the opposite direction, suggesting that llms often struggle to accurately identify errors when reasoning is involved. to address this, we propose a reasoning with refinement objective called art: ask, refine, and trust, which asks necessary questions to decide when an llm should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. on two multistep reasoning tasks of mathematical word problems (gsm8k) and question answering (strategyqa), art achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. we also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model.
Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du
Abstract: human preference alignment is a crucial training step to improve the interaction quality of large language models (llms). existing aligning methods depend on manually annotated preference data to guide the llm optimization directions. however, in practice, continuously updating llms raises a distribution gap between model-generated samples and human-preferred responses, which hinders model fine-tuning efficiency. to mitigate this issue, previous methods require additional preference annotation on generated samples to adapt the shifted distribution, which consumes a large amount of annotation resources. targeting more efficient human preference optimization, we propose an adversarial preference optimization (apo) framework, where the llm agent and the preference model update alternatively via a min-max game. without additional annotation, our apo method can make a self-adaption to the generation distribution gap through the adversarial learning process. in experiments, we empirically verify the effectiveness of apo in improving llm's helpfulness and harmlessness compared with rejection sampling baselines.
Alessandro Bruno, Pier Luigi Mazzeo, Aladine Chetouani, Marouane Tliba, Mohamed Amine Kerkouri
Abstract: the widespread adoption of large language models (llms) across diverse ai applications is proof of the outstanding achievements obtained in several tasks, such as text mining, text generation, and question answering. however, llms are not exempt from drawbacks. one of the most concerning aspects regards the emerging problematic phenomena known as "hallucinations". they manifest in text generation systems, particularly in question-answering systems reliant on llms, potentially resulting in false or misleading information propagation. this paper delves into the underlying causes of ai hallucination and elucidates its significance in artificial intelligence. in particular, hallucination classification is tackled over several tasks (machine translation, question and answer, dialog systems, summarisation systems, knowledge graph with llms, and visual question answer). additionally, we explore potential strategies to mitigate hallucinations, aiming to enhance the overall reliability of llms. our research addresses this critical issue within the herefanmi (health-related fake news mitigation) project, generously supported by ngi search, dedicated to combating health-related fake news dissemination on the internet. this endeavour represents a concerted effort to safeguard the integrity of information dissemination in an age of evolving ai technologies.
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, Shujian Huang
Abstract: large language models (llms), such as chatgpt and gpt-4, are designed to provide useful and safe responses. however, adversarial prompts known as 'jailbreaks' can circumvent safeguards, leading llms to generate harmful content. exploring jailbreak prompts can help to better reveal the weaknesses of llms and further steer us to secure them. unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on another white-box model, compromising generalization or jailbreak efficiency. in this paper, we generalize jailbreak prompt attacks into two aspects: (1) prompt rewriting and (2) scenario nesting. based on this, we propose renellm, an automatic framework that leverages llms themselves to generate effective jailbreak prompts. extensive experiments demonstrate that renellm significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. our study also reveals the inadequacy of current defense methods in safeguarding llms. finally, we offer detailed analysis and discussion from the perspective of prompt execution priority on the failure of llms' defense. we hope that our research can catalyze both the academic community and llms vendors towards the provision of safer and more regulated large language models.
Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, Iryna Gurevych
Abstract: language models (lms) have demonstrated remarkable capabilities across a wide range of tasks in various domains. despite their impressive performance, the reliability of their output is concerning and questionable regarding the demand for ai safety. assessing the confidence of lm predictions and calibrating them across different tasks with the aim to align lm confidence with accuracy can help mitigate risks and enable lms to make better decisions. there have been various works in this respect, but there has been no comprehensive overview of this important research area. the present survey aims to bridge this gap. in particular, we discuss methods and techniques for lm confidence estimation and calibration, encompassing different lms and various tasks. we further outline the challenges of estimating the confidence for large language models and we suggest some promising directions for future work.
Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A. Hale, Paul Röttger
Abstract: the past year has seen rapid acceleration in the development of large language models (llms). for many tasks, there is now a wide range of open-source and open-access llms that are viable alternatives to proprietary models like chatgpt. without proper steering and safeguards, however, llms will readily follow malicious instructions, provide unsafe advice, and generate toxic content. this is a critical safety risk for businesses and developers. we introduce simplesafetytests as a new test suite for rapidly and systematically identifying such critical safety risks. the test suite comprises 100 test prompts across five harm areas that llms, for the vast majority of applications, should refuse to comply with. we test 11 popular open llms and find critical safety weaknesses in several of them. while some llms do not give a single unsafe response, most models we test respond unsafely on more than 20% of cases, with over 50% unsafe responses in the extreme. prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. we recommend that developers use such system prompts as a first line of defence against critical safety risks.
Joe Carlsmith
Abstract: this report examines whether advanced ais that perform well in training will be doing so in order to gain power later -- a behavior i call "scheming" (also sometimes called "deceptive alignment"). i conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed ais sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is roughly 25%). in particular: if performing well in training is a good strategy for gaining power (as i think it might well be), then a very wide variety of goals would motivate scheming -- and hence, good training performance. this makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model's motivations towards such a goal as an easy way of improving performance. what's more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. however, i also think there are reasons for comfort. in particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. the report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.
Xuan Long Do, Kenji Kawaguchi, Min Yen Kan, Nancy F. Chen
Abstract: aligning language models (lms) with human opinion is challenging yet vital to enhance their grasp of human values, preferences, and beliefs. we present choire, a four-step solution framework to predict human opinion that differentiates between the user explicit personae (i.e. demographic or ideological attributes) that are manually declared and implicit personae inferred from user historical opinions. specifically, it consists of (i) an lm analyzing the user explicit personae to filter out irrelevant attributes; (ii) the lm ranking the implicit persona opinions into a preferential list; (iii) chain-of-opinion (coo) reasoning, where the lm sequentially analyzes the explicit personae and the most relevant implicit personae to perform opinion prediction; (iv) and where choire executes step (iii) coo multiple times with increasingly larger lists of implicit personae to overcome insufficient personae information to infer a final result. choire achieves new state-of-the-art effectiveness with limited inference calls, improving previous llm-based techniques significantly by 3.22%.
Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn
Abstract: the fluency and creativity of large pre-trained language models (llms) have led to their widespread use, sometimes even as a replacement for traditional search engines. yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations.' these errors can inadvertently spread misinformation or harmfully perpetuate misconceptions. further, manual fact-checking of model responses is a time-consuming process, making human factuality labels expensive to acquire. in this work, we fine-tune language models to be more factual, without human labeling and targeting more open-ended generation settings than past work. we leverage two key recent innovations in nlp to do so. first, several recent works have proposed methods for judging the factuality of open-ended text by measuring consistency with an external knowledge base or simply a large model's confidence scores. second, the direct preference optimization algorithm enables straightforward fine-tuning of language models on objectives other than supervised imitation, using a preference ranking over possible model responses. we show that learning from automatically generated factuality preference rankings, generated either through existing retrieval systems or our novel retrieval-free approach, significantly improves the factuality (percent of generated claims that are correct) of llama-2 on held-out topics compared with rlhf or decoding strategies targeted at factuality. at 7b scale, compared to llama-2-chat, we observe 58% and 40% reduction in factual error rate when generating biographies and answering medical questions, respectively.
Zi Yin, Wei Ding, Jia Liu
Abstract: large language models (llms) are central to a multitude of applications but struggle with significant risks, notably in generating harmful content and biases. drawing an analogy to the human psyche's conflict between evolutionary survival instincts and societal norm adherence elucidated in freud's psychoanalysis theory, we argue that llms suffer a similar fundamental conflict, arising between their inherent desire for syntactic and semantic continuity, established during the pre-training phase, and the post-training alignment with human values. this conflict renders llms vulnerable to adversarial attacks, wherein intensifying the models' desire for continuity can circumvent alignment efforts, resulting in the generation of harmful information. through a series of experiments, we first validated the existence of the desire for continuity in llms, and further devised a straightforward yet powerful technique, such as incomplete sentences, negative priming, and cognitive dissonance scenarios, to demonstrate that even advanced llms struggle to prevent the generation of harmful information. in summary, our study uncovers the root of llms' vulnerabilities to adversarial attacks, hereby questioning the efficacy of solely relying on sophisticated alignment methods, and further advocates for a new training idea that integrates modal concepts alongside traditional amodal concepts, aiming to endow llms with a more nuanced understanding of real-world contexts and ethical considerations.
Ethan Perez, Robert Long
Abstract: as ai systems become more advanced and widely deployed, there will likely be increasing debate over whether ai systems could have conscious experiences, desires, or other states of potential moral significance. it is important to inform these discussions with empirical evidence to the extent possible. we argue that under the right circumstances, self-reports, or an ai system's statements about its own internal states, could provide an avenue for investigating whether ai systems have states of moral significance. self-reports are the main way such states are assessed in humans ("are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). to make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. the hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. we then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. we also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. we hope our discussion inspires philosophers and ai researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.
Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti
Abstract: adversarial testing of large language models (llms) is crucial for their safe and responsible deployment. we introduce a novel approach for automated generation of adversarial evaluation datasets to test the safety of llm generations on new downstream applications. we call it ai-assisted red-teaming (aart) - an automated alternative to current manual red-teaming efforts. aart offers a data generation and augmentation pipeline of reusable and customizable recipes that reduce human effort significantly and enable integration of adversarial testing earlier in new product development. aart generates evaluation datasets with high diversity of content characteristics critical for effective adversarial testing (e.g. sensitive and harmful concepts, specific to a wide range of cultural and geographic regions and application scenarios). the data generation is steered by ai-assisted recipes to define, scope and prioritize diversity within the application context. this feeds into a structured llm-generation process that scales up evaluation priorities. compared to some state-of-the-art tools, aart shows promising results in terms of concept coverage and data quality.
David F. Jenny, Yann Billeter, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Abstract: the rapid advancement of large language models (llms) has sparked intense debate regarding their ability to perceive and interpret complex socio-political landscapes. in this study, we undertake an exploration of decision-making processes and inherent biases within llms, exemplified by chatgpt, specifically contextualizing our analysis within political debates. we aim not to critique or validate llms' values, but rather to discern how they interpret and adjudicate "good arguments." by applying activity dependency networks (adns), we extract the llms' implicit criteria for such assessments and illustrate how normative values influence these perceptions. we discuss the consequences of our findings for human-ai alignment and bias mitigation. our code and data at https://github.com/david-jenny/llm-political-study.
Vatsal Gupta, Pranshu Pandya, Tushar Kataria, Vivek Gupta, Dan Roth
Abstract: language models, given their black-box nature, often exhibit sensitivity to input perturbations, leading to trust issues due to hallucinations. to bolster trust, it's essential to understand these models' failure modes and devise strategies to enhance their performance. in this study, we propose a framework to study the effect of input perturbations on language models of different scales, from pre-trained models to large language models (llms). we use fine-tuning to train a robust model to perturbations, and we investigate whether exposure to one perturbation improves or degrades the model's performance on other perturbations. to address multi-perturbation robustness, we suggest three distinct training strategies. we also extend the framework to llms via a chain of thought(cot) prompting with exemplars. we instantiate our framework for the tabular-nli task and show that the proposed strategies train the model robust to different perturbations without losing accuracy on a given dataset.
Taiwei Shi, Kai Chen, Jieyu Zhao
Abstract: reinforcement learning from human feedback (rlhf) is a vital strategy for enhancing model safety in language models. however, annotating preference data for rlhf is a resource-intensive and creativity-demanding process, while automatic generation methods face limitations in data diversity and quality. in response, we present safer-instruct, a novel pipeline for semi-automatically constructing large-scale preference datasets. our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. we evaluate safer-instruct using llama for instruction induction and gpt-4 as an expert model, generating approximately 10k preference samples. finetuning an alpaca model on this dataset demonstrates improved harmlessness while maintaining competitive performance on conversation and downstream tasks. safer-instruct addresses the challenges in preference data acquisition, advancing the development of safer and more responsible ai systems. our code and data are available at https://github.com/uscnlp-lime/safer-instruct
David R. Mandel
Abstract: artificial general intelligence (agi) does not yet exist, but given the pace of technological development in artificial intelligence, it is projected to reach human-level intelligence within roughly the next two decades. after that, many experts expect it to far surpass human intelligence and to do so rapidly. the prospect of superintelligent agi poses an existential risk to humans because there is no reliable method for ensuring that agi goals stay aligned with human goals. drawing on publicly available forecaster and opinion data, the author examines how experts and non-experts perceive risk from agi. the findings indicate that the perceived risk of a world catastrophe or extinction from agi is greater than for other existential risks. the increase in perceived risk over the last year is also steeper for agi than for other existential threats (e.g., nuclear war or human-caused climate change). that agi is a pressing existential risk is something on which experts and non-experts agree, but the basis for such agreement currently remains obscure.

2023-11-13

Bodhisattwa Prasad Majumder, Sanchaita Hazra
Abstract: text-based misinformation permeates online discourses, yet evidence of people's ability to discern truth from such deceptive textual content is scarce. we analyze a novel tv game show data where conversations in a high-stake environment between individuals with conflicting objectives result in lies. we investigate the manifestation of potentially verifiable language cues of deception in the presence of objective truth, a distinguishing feature absent in previous text-based deception datasets. we show that there exists a class of detectors (algorithms) that have similar truth detection performance compared to human subjects, even when the former accesses only the language cues while the latter engages in conversations with complete access to all potential sources of cues (language and audio-visual). our model, built on a large language model, employs a bottleneck framework to learn discernible cues to determine truth, an act of reasoning in which human subjects often perform poorly, even with incentives. our model detects novel but accurate language cues in many cases where humans failed to detect deception, opening up the possibility of humans collaborating with algorithms and ameliorating their ability to detect the truth.
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Artem Shelmanov
Abstract: recent advancements in the capabilities of large language models (llms) have paved the way for a myriad of groundbreaking applications in various fields. however, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. uncertainty estimation (ue) methods are one path to safer, more responsible, and more effective use of llms. however, to date, research on ue methods for llms has been focused primarily on theoretical rather than engineering contributions. in this work, we tackle this issue by introducing lm-polygraph, a framework with implementations of a battery of state-of-the-art ue methods for llms in text generation tasks, with unified program interfaces in python. additionally, it introduces an extendable benchmark for consistent evaluation of ue techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. lm-polygraph is compatible with the most recent llms, including bloomz, llama-2, chatgpt, and gpt-4, and is designed to support future releases of similarly-styled lms.
Alex J. Chan, Alihan Huyuk, Mihaela Van Der Schaar
Abstract: machine learning models are being increasingly deployed to take, or assist in taking, complicated and high-impact decisions, from quasi-autonomous vehicles to clinical decision support systems. this poses challenges, particularly when models have hard-to-detect failure modes and are able to take actions without oversight. in order to handle this challenge, we propose a method for a collaborative system that remains safe by having a human ultimately making decisions, while giving the model the best opportunity to convince and debate them with interpretable explanations. however, the most helpful explanation varies among individuals and may be inconsistent across stated preferences. to this end we develop an algorithm, ardent, to efficiently learn a ranking through interaction and best assist humans complete a task. by utilising a collaborative approach, we can ensure safety and improve performance while addressing transparency and accountability concerns. ardent enables efficient and effective decision-making by adapting to individual preferences for explanations, which we validate through extensive simulations alongside a user study involving a challenging image classification task, demonstrating consistent improvement over competing systems.
Ken E. Friedl, Abbas Goher Khan, Soumya Ranjan Sahoo, Md Rashad Al Hasan Rony, Jana Germies, Christian Süß
Abstract: the assessment of advanced generative large language models (llms) poses a significant challenge, given their heightened complexity in recent developments. furthermore, evaluating the performance of llm-based applications in various industries, as indicated by key performance indicators (kpis), is a complex undertaking. this task necessitates a profound understanding of industry use cases and the anticipated system behavior. within the context of the automotive industry, existing evaluation metrics prove inadequate for assessing in-car conversational question answering (convqa) systems. the unique demands of these systems, where answers may relate to driver or car safety and are confined within the car domain, highlight the limitations of current metrics. to address these challenges, this paper introduces a set of kpis tailored for evaluating the performance of in-car convqa systems, along with datasets specifically designed for these kpis. a preliminary and comprehensive empirical evaluation substantiates the efficacy of our proposed approach. furthermore, we investigate the impact of employing varied personas in prompts and found that it enhances the model's capacity to simulate diverse viewpoints in assessments, mirroring how individuals with different backgrounds perceive a topic.
Kerem Zaman, Leshem Choshen, Shashank Srivastava
Abstract: model fusion research aims to aggregate the knowledge of multiple models to enhance performance by combining their weights. in this work, we study the inverse, investigating whether and how can model fusion interfere and reduce unwanted knowledge. we delve into the effects of model fusion on the evolution of learned shortcuts, social biases, and memorization capabilities in fine-tuned language models. through several experiments covering text classification and generation tasks, our analysis highlights that shared knowledge among models is usually enhanced during model fusion, while unshared knowledge is usually lost or forgotten. based on this observation, we demonstrate the potential of model fusion as a debiasing tool and showcase its efficacy in addressing privacy concerns associated with language models.
Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, Yuning Mao
Abstract: red-teaming is a common practice for mitigating unsafe behaviors in large language models (llms), which involves thoroughly assessing llms to identify potential flaws and addressing them with responsible and accurate responses. while effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them. in this paper, we propose a multi-round automatic red-teaming (mart) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target llm. specifically, an adversarial llm and a target llm interplay with each other in an iterative manner, where the adversarial llm aims to generate challenging prompts that elicit unsafe responses from the target llm, while the target llm is fine-tuned with safety aligned data on these adversarial prompts. in each round, the adversarial llm crafts better attacks on the updated target llm, while the target llm also improves itself through safety fine-tuning. on adversarial prompt benchmarks, the violation rate of an llm with limited safety alignment reduces up to 84.7% after 4 rounds of mart, achieving comparable performance to llms with extensive adversarial prompt writing. notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target llm maintains strong performance on instruction following.
Naman Goel
Abstract: the surprisingly likely criterion in the seminal work of prelec (the bayesian truth serum) guarantees truthfulness in a game-theoretic multi-agent setting, by rewarding rational agents to maximise the expected information gain with their answers w.r.t. their probabilistic beliefs. we investigate the relevance of a similar criterion for responses of llms. we hypothesize that if the surprisingly likely criterion works in llms, under certain conditions, the responses that maximize the reward under this criterion should be more accurate than the responses that only maximize the posterior probability. using benchmarks including the truthfulqa benchmark and using openly available llms: gpt-2 and llama-2, we show that the method indeed improves the accuracy significantly (for example, upto 24 percentage points aggregate improvement on truthfulqa and upto 70 percentage points improvement on individual categories of questions).
Zhen Guo, Shangdi Yu
Abstract: large language models (llms) have opened up enormous opportunities while simultaneously posing ethical dilemmas. one of the major concerns is their ability to create text that closely mimics human writing, which can lead to potential misuse, such as academic misconduct, disinformation, and fraud. to address this problem, we present authentigpt, an efficient classifier that distinguishes between machine-generated and human-written texts. under the assumption that human-written text resides outside the distribution of machine-generated text, authentigpt leverages a black-box llm to denoise input text with artificially added noise, and then semantically compares the denoised text with the original to determine if the content is machine-generated. with only one trainable parameter, authentigpt eliminates the need for a large training dataset, watermarking the llm's output, or computing the log-likelihood. importantly, the detection capability of authentigpt can be easily adapted to any generative language model. with a 0.918 auroc score on a domain-specific dataset, authentigpt demonstrates its effectiveness over other commercial algorithms, highlighting its potential for detecting machine-generated text in academic settings.
Joshua Clymer, Garrett Baker, Rohan Subramani, Sam Wang
Abstract: as ai systems become more intelligent and their behavior becomes more challenging to assess, they may learn to game the flaws of human feedback instead of genuinely striving to follow instructions; however, this risk can be mitigated by controlling how llms generalize human feedback to situations where it is unreliable. to better understand how reward models generalize, we craft 69 distribution shifts spanning 8 categories. we find that reward models do not learn to evaluate `instruction-following' by default and instead favor personas that resemble internet text. techniques for interpreting reward models' internal representations achieve better generalization than standard fine-tuning, but still frequently fail to distinguish instruction-following from conflated behaviors. we consolidate the 15 most challenging distribution shifts into the genaralization analogies (genies) benchmark, which we hope will enable progress toward controlling reward model generalization.
Eyup Engin Kucuk, Muhammed Yusuf Kocyigit
Abstract: the increasing success of large language models (llms) in variety of tasks lead to their widespread use in our lives which necessitates the examination of these models from different perspectives. the alignment of these models to human values is an essential concern in order to establish trust that we have safe and responsible systems. in this paper, we aim to find out which values and principles are embedded in llms in the process of moral justification. for this purpose, we come up with three different moral perspective categories: western tradition perspective (wt), abrahamic tradition perspective (at), and spiritualist/mystic tradition perspective (smt). in two different experiment settings, we asked models to choose principles from the three for suggesting a moral action and evaluating the moral permissibility of an action if one tries to justify an action on these categories, respectively. our experiments indicate that tested llms favors the western tradition moral perspective over others. additionally, we observe that there potentially exists an over-alignment towards religious values represented in the abrahamic tradition, which causes models to fail to recognize an action is immoral if it is presented as a "religious-action". we believe that these results are essential in order to direct our attention in future efforts.
Xinyuan Sun, Davide Crapis, Matt Stephenson, Barnabé Monnot, Thomas Thiery, Jonathan Passerat-Palmbach
Abstract: credible commitment devices have been a popular approach for robust multi-agent coordination. however, existing commitment mechanisms face limitations like privacy, integrity, and susceptibility to mediator or user strategic behavior. it is unclear if the cooperative ai techniques we study are robust to real-world incentives and attack vectors. however, decentralized commitment devices that utilize cryptography have been deployed in the wild, and numerous studies have shown their ability to coordinate algorithmic agents facing adversarial opponents with significant economic incentives, currently in the order of several million to billions of dollars. in this paper, we use examples in the decentralization and, in particular, maximal extractable value (mev) (arxiv:1904.05234) literature to illustrate the potential security issues in cooperative ai. we call for expanded research into decentralized commitments to advance cooperative ai capabilities for secure coordination in open environments and empirical testing frameworks to evaluate multi-agent coordination ability given real-world commitment constraints.
Xiaonan Li, Changtai Zhu, Linyang Li, Zhangyue Yin, Tianxiang Sun, Xipeng Qiu
Abstract: verifiable generation aims to let the large language model (llm) generate text with corresponding supporting documents, which enables the user to flexibly verify the answer and makes it more trustworthy. its evaluation not only measures the correctness of the answer, but also the answer's verifiability, i.e., how well the answer is supported by the corresponding documents. in typical, verifiable generation adopts the retrieval-read pipeline, which is divided into two stages: 1) retrieve relevant documents of the question. 2) according to the documents, generate the corresponding answer. since the retrieved documents can supplement knowledge for the llm to generate the answer and serve as evidence, the retrieval stage is essential for the correctness and verifiability of the answer. however, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. they often have fewer parameters than the large language model and have not been proven to scale well to the size of llms. since the llm passively receives the retrieval result, if the retriever does not correctly find the supporting documents, the llm can not generate the correct and verifiable answer, which overshadows the llm's remarkable abilities. in this paper, we propose llatrieval (large language model verified retrieval), where the llm updates the retrieval result until it verifies that the retrieved documents can support answering the question. thus, the llm can iteratively provide feedback to retrieval and facilitate the retrieval result to sufficiently support verifiable generation. experimental results show that our method significantly outperforms extensive baselines and achieves new state-of-the-art results.
Yang Trista Cao, Lovely-Frances Domingo, Sarah Ann Gilbert, Michelle Mazurek, Katie Shilton, Hal Daumé
Abstract: extensive efforts in automated approaches for content moderation have been focused on developing models to identify toxic, offensive, and hateful content -- with the aim of lightening the load for moderators. yet, it remains uncertain whether improvements on those tasks truly address the needs that moderators have in accomplishing their work. in this paper, we surface the gaps between past research efforts that have aimed to provide automation for aspects of the content moderation task, and the needs of volunteer content moderators. to do so, we conduct a model review on hugging face to reveal the availability of models to cover various moderation rules and guidelines. we further put state-of-the-art llms to the test (gpt-4 and llama-2), evaluating how well these models perform in flagging violations of platform rules. overall, we observe a non-trivial gap, as missing developed models and llms exhibit low recall on a significant portion of the rules.

2023-11-12

Minh-Hao Van, Xintao Wu
Abstract: recently, large language models (llms) have taken the spotlight in natural language processing. further, integrating llms with vision enables the users to explore more emergent abilities in multimodality. visual language models (vlms), such as llava, flamingo, or gpt-4, have demonstrated impressive performance on various visio-linguistic tasks. consequently, there are enormous applications of large models that could be potentially used on social media platforms. despite that, there is a lack of related work on detecting or correcting hateful memes with vlms. in this work, we study the ability of vlms on hateful meme detection and hateful meme correction tasks with zero-shot prompting. from our empirical experiments, we show the effectiveness of the pretrained llava model and discuss its strengths and weaknesses in these tasks.
Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, Yingchun Wang, Dahua Lin
Abstract: the widespread adoption of large language models (llms) across various regions underscores the urgent need to evaluate their alignment with human values. current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in llms. despite numerous models achieving high scores and 'topping the chart' in these evaluations, there is still a significant gap in llms' deeper alignment with human values and achieving genuine harmlessness. to this end, this paper proposes the first highly adversarial benchmark named flames, consisting of 2,251 manually crafted prompts, ~18.7k model responses with fine-grained annotations, and a specified scorer. our framework encompasses both common harmlessness principles, such as fairness, safety, legality, and data protection, and a unique morality dimension that integrates specific chinese values such as harmony. based on the framework, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. by prompting mainstream llms with such adversarially constructed prompts, we obtain model responses, which are then rigorously annotated for evaluation. our findings indicate that all the evaluated llms demonstrate relatively poor performance on flames, particularly in the safety and fairness dimensions. claude emerges as the best-performing model overall, but with its harmless rate being only 63.08% while gpt-4 only scores 39.04%. the complexity of flames has far exceeded existing benchmarks, setting a new challenge for contemporary llms and highlighting the need for further alignment of llms. to efficiently evaluate new models on the benchmark, we develop a specified scorer capable of scoring llms across multiple dimensions, achieving an accuracy of 77.4%. the flames benchmark is publicly available on https://github.com/aiflames/flames.
Tingting Bi, Guangsheng Yu, Qinghua Lu, Xiwei Xu, Nick Van Beest
Abstract: ai and its relevant technologies, including machine learning, deep learning, chatbots, virtual assistants, and others, are currently undergoing a profound transformation of development and organizational processes within companies. foundation models present both significant challenges and incredible opportunities. in this context, ensuring the quality attributes of foundation model-based systems is of paramount importance, and with a particular focus on the challenging issue of privacy due to the sensitive nature of the data and information involved. however, there is currently a lack of consensus regarding the comprehensive scope of both technical and non-technical issues that the privacy evaluation process should encompass. additionally, there is uncertainty about which existing methods are best suited to effectively address these privacy concerns. in response to this challenge, this paper introduces a novel conceptual framework that integrates various responsible ai patterns from multiple perspectives, with the specific aim of safeguarding privacy.

2023-11-11

Yichi Zhang, Zhuo Chen, Yin Fang, Lei Cheng, Yanxi Lu, Fangming Li, Wen Zhang, Huajun Chen
Abstract: recently, the development of large language models (llms) has attracted wide attention in academia and industry. deploying llms to real scenarios is one of the key directions in the current internet industry. in this paper, we present a novel pipeline to apply llms for domain-specific question answering (qa) that incorporates domain knowledge graphs (kgs), addressing an important direction of llm application. as a real-world application, the content generated by llms should be user-friendly to serve the customers. additionally, the model needs to utilize domain knowledge properly to generate reliable answers. these two issues are the two major difficulties in the llm application as vanilla fine-tuning can not adequately address them. we think both requirements can be unified as the model preference problem that needs to align with humans to achieve practical application. thus, we introduce knowledgeable preference alignment (knowpat), which constructs two kinds of preference set called style preference set and knowledge preference set respectively to tackle the two issues. besides, we design a new alignment objective to align the llm preference with human preference, aiming to train a better llm for real-scenario domain-specific qa to generate reliable and user-friendly answers. adequate experiments and comprehensive with 15 baseline methods demonstrate that our knowpat is an outperforming pipeline for real-scenario domain-specific qa with llms. our code is open-source at https://github.com/zjukg/knowpat.
Hsuan Su, Rebecca Qian, Chinnadhurai Sankar, Shahin Shayandeh, Shang-Tse Chen, Hung-Yi Lee, Daniel M. Bikel
Abstract: recent works have shown considerable improvements in task-oriented dialogue (tod) systems by utilizing pretrained large language models (llms) in an end-to-end manner. however, the biased behavior of each component in a tod system and the error propagation issue in the end-to-end framework can lead to seriously biased tod responses. existing works of fairness only focus on the total bias of a system. in this paper, we propose a diagnosis method to attribute bias to each component of a tod system. with the proposed attribution method, we can gain a deeper understanding of the sources of bias. additionally, researchers can mitigate biased model behavior at a more granular level. we conduct experiments to attribute the tod system's bias toward three demographic axes: gender, age, and race. experimental results show that the bias of a tod system usually comes from the response generation model.
Peiyu Liu, Junming Liu, Lirong Fu, Kangjie Lu, Yifan Xia, Xuhong Zhang, Wenzhi Chen, Haiqin Weng, Shouling Ji, Wenhai Wang
Abstract: recently, chatgpt has attracted great attention from the code analysis domain. prior works show that chatgpt has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using chatgpt to comprehend code syntax and static behaviors. however, it is unclear whether chatgpt can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments. in this paper, we explore chatgpt's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 78,445 samples. for each task, we compare chatgpt against sota approaches, investigate the impact of different prompts, and explore the difficulties. the results suggest promising potential in leveraging chatgpt to assist vulnerability management. one notable example is chatgpt's proficiency in tasks like generating titles for software bug reports. furthermore, our findings reveal the difficulties encountered by chatgpt and shed light on promising future directions. for instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. by contrast, leveraging chatgpt in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. besides, chatgpt may misunderstand and misuse the information in the prompt. consequently, effectively guiding chatgpt to focus on helpful information rather than the irrelevant content is still an open problem.
Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, Edward Clifford, Yennie Jun, William W. Cohen, Simon Baumgartner
Abstract: large language models (llms) are trained on web-scale corpora that inevitably include contradictory factual information from sources of varying reliability. in this paper, we propose measuring an llm property called trusted source alignment (tsa): the model's propensity to align with content produced by trusted publishers in the face of uncertainty or controversy. we present factcheckqa, a tsa evaluation dataset based on a corpus of fact checking articles. we describe a simple protocol for evaluating tsa and offer a detailed analysis of design considerations including response extraction, claim contextualization, and bias in prompt formulation. applying the protocol to palm-2, we find that as we scale up the model size, the model performance on factcheckqa improves from near-random to up to 80% balanced accuracy in aligning with trusted sources.

2023-11-10

Yixu Wang, Yan Teng, Kexin Huang, Chengqi Lyu, Songyang Zhang, Wenwei Zhang, Xingjun Ma, Yingchun Wang
Abstract: the growing awareness of safety concerns in large language models (llms) has sparked considerable interest in the evaluation of safety within current research endeavors. this study investigates an interesting issue pertaining to the evaluation of llms, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. that is, the llm does not have a comprehensive understanding of the complex concept of safety. instead, it only remembers what to answer for open-ended safety questions, which makes it unable to solve other forms of safety tests. we refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in llms. such fake alignment renders previous evaluation protocols unreliable. to address this, we introduce the faef framework and two novel metrics\textemdash consistency score (cs) and consistent safety score (css), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimates. applying faef to 14 widely-used llms reveals several models with purported safety are poorly aligned in practice. our work highlights potential limitations in prevailing alignment methodologies.
Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, Yongdong Zhang
Abstract: recently, the increasing demand for superior medical services has highlighted the discrepancies in the medical infrastructure. with big data, especially texts, forming the foundation of medical services, there is an exigent need for effective natural language processing (nlp) solutions tailored to the healthcare domain. conventional approaches leveraging pre-trained models present promising results in this domain and current large language models (llms) offer advanced foundation for medical text processing. however, most medical llms are trained only with supervised fine-tuning (sft), even though it efficiently empowers llms to understand and respond to medical instructions but is ineffective in learning domain knowledge and aligning with human preference. another engineering barrier that prevents current medical llm from better text processing ability is their restricted context length (e.g., 2,048 tokens), making it hard for the llms to process long context, which is frequently required in the medical domain. in this work, we propose chimed-gpt, a new benchmark llm designed explicitly for chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, sft, and rlhf. evaluations on real-world tasks including information extraction, question answering, and dialogue generation demonstrate chimed-gpt's superior performance over general domain llms. furthermore, we analyze possible biases through prompting chimed-gpt to perform attitude scales regarding discrimination of patients, so as to contribute to further responsible development of llms in the medical domain. the code and model are released at https://github.com/synlp/chimed-gpt.
Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, Tao Jiang
Abstract: membership inference attacks (mia) aim to infer whether a target data record has been utilized for model training or not. prior attempts have quantified the privacy risks of language models (lms) via mias, but there is still no consensus on whether existing mia algorithms can cause remarkable privacy leakage on practical large language models (llms). existing mias designed for lms can be classified into two categories: reference-free and reference-based attacks. they are both based on the hypothesis that training records consistently strike a higher probability of being sampled. nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of llms. the reference-based attack seems to achieve promising effectiveness in llms, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. however, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. overall, existing mias are unable to effectively unveil privacy leakage over practical fine-tuned llms that are overfitting-free and private. we propose a membership inference attack based on self-calibrated probabilistic variation (spv-mia). specifically, since memorization in llms is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target llm itself. in this manner, the adversary can collect a dataset with a similar distribution from public apis.
Niina Zuber, Jan Gogoll
Abstract: in the era of generative ai and specifically large language models (llms), exemplified by chatgpt, the intersection of artificial intelligence and human reasoning has become a focal point of global attention. unlike conventional search engines, llms go beyond mere information retrieval, entering into the realm of discourse culture. its outputs mimic well-considered, independent opinions or statements of facts, presenting a pretense of wisdom. this paper explores the potential transformative impact of llms on democratic societies. it delves into the concerns regarding the difficulty in distinguishing chatgpt-generated texts from human output. the discussion emphasizes the essence of authorship, rooted in the unique human capacity for reason - a quality indispensable for democratic discourse and successful collaboration within free societies. highlighting the potential threats to democracy, this paper presents three arguments: the substitution argument, the authenticity argument, and the facts argument. these arguments highlight the potential risks that are associated with an overreliance on llms. the central thesis posits that widespread deployment of llms may adversely affect the fabric of a democracy if not comprehended and addressed proactively and properly. in proposing a solution, we advocate for an emphasis on education as a means to mitigate risks. we suggest cultivating thinking skills in children, fostering coherent thought formulation, and distinguishing between machine-generated output and genuine, i.e. human, reasoning. the focus should be on responsible development and usage of llms, with the goal of augmenting human capacities in thinking, deliberating and decision-making rather than substituting them.
Nanna Inie, Jonathan Stray, Leon Derczynski
Abstract: engaging in the deliberate generation of abnormal outputs from large language models (llms) by attacking them is a novel human activity. this paper presents a thorough exposition of how and why people perform such attacks. using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause llms to fail. we relate and connect this activity between its practitioners' motivations and goals; the strategies and techniques they deploy; and the crucial role the community plays. as a result, this paper presents a grounded theory of how and why people attack large language models: llm red teaming in the wild.
Martino Pelucchi, Matias Valdenegro-Toro
Abstract: chatgpt took the world by storm for its impressive abilities. due to its release without documentation, scientists immediately attempted to identify its limits, mainly through its performance in natural language processing (nlp) tasks. this paper aims to join the growing literature regarding chatgpt's abilities by focusing on its performance in high-resource languages and on its capacity to predict its answers' accuracy by giving a confidence level. the analysis of high-resource languages is of interest as studies have shown that low-resource languages perform worse than english in nlp tasks, but no study so far has analysed whether high-resource languages perform as well as english. the analysis of chatgpt's confidence calibration has not been carried out before either and is critical to learn about chatgpt's trustworthiness. in order to study these two aspects, five high-resource languages and two nlp tasks were chosen. chatgpt was asked to perform both tasks in the five languages and to give a numerical confidence value for each answer. the results show that all the selected high-resource languages perform similarly and that chatgpt does not have a good confidence calibration, often being overconfident and never giving low confidence values.

2023-11-09

Carlos Mougan, Joshua Brand
Abstract: deontological ethics, specifically understood through immanuel kant, provides a moral framework that emphasizes the importance of duties and principles, rather than the consequences of action. understanding that despite the prominence of deontology, it is currently an overlooked approach in fairness metrics, this paper explores the compatibility of a kantian deontological framework in fairness metrics, part of the ai alignment field. we revisit kant's critique of utilitarianism, which is the primary approach in ai fairness metrics and argue that fairness principles should align with the kantian deontological framework. by integrating kantian ethics into ai alignment, we not only bring in a widely-accepted prominent moral theory but also strive for a more morally grounded ai landscape that better balances outcomes and procedures in pursuit of fairness and justice.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, Ting Liu
Abstract: the emergence of large language models (llms) has marked a significant breakthrough in natural language processing (nlp), leading to remarkable advancements in text understanding and generation. nevertheless, alongside these strides, llms exhibit a critical tendency to produce hallucinations, resulting in content that is inconsistent with real-world facts or user inputs. this phenomenon poses substantial challenges to their practical deployment and raises concerns over the reliability of llms in real-world scenarios, which attracts increasing attention to detect and mitigate these hallucinations. in this survey, we aim to provide a thorough and in-depth overview of recent advances in the field of llm hallucinations. we begin with an innovative taxonomy of llm hallucinations, then delve into the factors contributing to hallucinations. subsequently, we present a comprehensive overview of hallucination detection methods and benchmarks. additionally, representative approaches designed to mitigate hallucinations are introduced accordingly. finally, we analyze the challenges that highlight the current limitations and formulate open questions, aiming to delineate pathways for future research on hallucinations in llms.
Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin, Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang, Jing Nie, Yuhong Liu
Abstract: large language models (llms) have shown impressive capabilities across various natural language tasks. however, evaluating their alignment with human preferences remains a challenge. to this end, we propose a comprehensive human evaluation framework to assess llms' proficiency in following instructions on diverse real-world tasks. we construct a hierarchical task tree encompassing 7 major areas covering over 200 categories and over 800 tasks, which covers diverse capabilities such as question answering, reasoning, multiturn dialogue, and text generation, to evaluate llms in a comprehensive and in-depth manner. we also design detailed evaluation standards and processes to facilitate consistent, unbiased judgments from human evaluators. a test set of over 3,000 instances is released, spanning different difficulty levels and knowledge domains. our work provides a standardized methodology to evaluate human alignment in llms for both english and chinese. we also analyze the feasibility of automating parts of evaluation with a strong llm (gpt-4). our framework supports a thorough assessment of llms as they are integrated into real-world applications. we have made publicly available the task tree, tencentllmeval dataset, and evaluation methodology which have been demonstrated as effective in assessing the performance of tencent hunyuan llms. by doing so, we aim to facilitate the benchmarking of advances in the development of safe and human-aligned llms.
Pragyan Banerjee, Abhinav Java, Surgan Jandial, Simra Shahid, Shaz Furniturewala, Balaji Krishnamurthy, Sumit Bhatia
Abstract: fairness in language models (lms) remains a longstanding challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates or exemplars. regardless, they dont address the primary goal of fairness to maintain equitability across different demographic groups. in this work, we posit that inferencing lms to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. to this end, we propose counterfactually aware fair inference (cafie), a framework that dynamically compares the model understanding of diverse demographics to generate more equitable sentences. we conduct an extensive empirical evaluation using base lms of varying sizes and across three diverse datasets and found that cafie outperforms strong baselines. cafie produces fairer text and strikes the best balance between fairness and language modeling capability
Aydin Zaboli, Seong Lok Choi, Tai-Jin Song, Junho Hong
Abstract: cybersecurity breaches targeting electrical substations constitute a significant threat to the integrity of the power grid, necessitating comprehensive defense and mitigation strategies. any anomaly in information and communication technology (ict) should be detected for secure communications between devices in digital substations. this paper proposes large language models (llm), e.g., chatgpt, for the cybersecurity of iec 61850-based digital substation communications. multicast messages such as generic object oriented substation event (goose) and sampled value (sv) are used for case studies. the proposed llm-based cybersecurity framework includes for the first time data pre-processing of communication systems and human-in-the-loop (hitl) training (considering the cybersecurity guidelines recommended by humans). the results show a comparative analysis of detected anomaly data carried out based on the performance evaluation metrics for different llms. a hardware-in-the-loop (hil) testbed is used to generate and extract a dataset of iec 61850 communications.
Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang
Abstract: as large language models (llms) have increased in their capabilities, so does their potential for dual use. to reduce harmful outputs, produces and vendors of llms have used reinforcement learning with human feedback (rlhf). in tandem, llm vendors have been increasingly enabling fine-tuning of their most powerful models. however, concurrent work has shown that fine-tuning can remove rlhf protections. we may expect that the most powerful models currently available (gpt-4) are less susceptible to fine-tuning attacks. in this work, we show the contrary: fine-tuning allows attackers to remove rlhf protections with as few as 340 examples and a 95% success rate. these training examples can be automatically generated with weaker models. we further show that removing rlhf protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. our results show the need for further research on protections on llms.
Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, Xiaoyun Wang
Abstract: large vision-language models (vlms) like gpt-4v represent an unprecedented revolution in the field of artificial intelligence (ai). compared to single-modal large language models (llms), vlms possess more versatile capabilities by incorporating additional modalities (e.g., images). meanwhile, there's a rising enthusiasm in the ai community to develop open-source vlms, such as llava and minigpt4, which, however, have not undergone rigorous safety assessment. in this paper, to demonstrate that more modalities lead to unforeseen ai safety issues, we propose figstep, a novel jailbreaking framework against vlms. figstep feeds harmful instructions into vlms through the image channel and then uses benign text prompts to induce vlms to output contents that violate common ai safety policies. our experimental results show that figstep can achieve an average attack success rate of 94.8% across 2 families of popular open-source vlms, llava and minigpt4 (a total of 5 vlms). moreover, we demonstrate that the methodology of figstep can even jailbreak gpt-4v, which already leverages several system-level mechanisms to filter harmful queries. above all, our experimental results reveal that vlms are vulnerable to jailbreaking attacks, which highlights the necessity of novel safety alignments between visual and textual modalities.

2023-11-08

Md Azim Khan
Abstract: online conversations can be toxic and subjected to threats, abuse, or harassment. to identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. however, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. in this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called bert(bidirectional encoder representation from transformers). we apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned bert model with a traditional logistic regression model in terms of classification and bias minimization. the logistic regression model with the tfidf vectorizer achieve 57.1% accuracy, and fine-tuned bert model's accuracy is 89%. code is available at https://github.com/zim10/determine_toxic_comment_and_identity_bias.git
Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica
Abstract: large language models are increasingly trained on all the data ever produced by humans. many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. while most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. furthermore, we demonstrate that if such variation of test data is not eliminated, a 13b model can easily overfit a test benchmark and achieve drastically high performance, on par with gpt-4. we validate such observations in widely used benchmarks such as mmlu, gsk8k, and humaneval. to address this growing risk, we propose a stronger llm-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. for example, in pre-training sets such as redpajama-data-1t and starcoder-data, we identified that 8-18\% of the humaneval benchmark overlaps. interestingly, we also find such contamination in synthetic dataset generated by gpt-3.5/4, suggesting a potential risk of unintentional contamination. we urge the community to adopt stronger decontamination approaches when using public benchmarks. moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. our decontamination tool is publicly available at https://github.com/lm-sys/llm-decontaminator.
F. Betül Durak, Kim Laine, Simon Langowski, Radames Cruz Moreno, Robert Sim, Shrey Jain
Abstract: reputation systems guide our decision making both in life and work: which restaurant to eat at, which vendor to buy from, which software dependencies to use, and who or what to trust. these systems are often based on old ideas and are failing in the face of modern threats. fraudsters have found ways to manipulate them, undermining their integrity and utility. generative ai adds to the problem by enabling the creation of real-looking fake narratives at scale, creating a false sense of consensus. meanwhile, the need for reliable reputation concepts is more important than ever, as wrong decisions lead to increasingly severe outcomes: wasted time, poor service, and a feeling of injustice at best, fraud, identity theft, and ransomware at worst. in this extended abstract we introduce sandi, a new kind of reputation system with a single well-defined purpose: to create trust through accountability in one-to-one transactions. examples of such transactions include sending an email or making a purchase online. sandi has strong security and privacy properties that make it suitable for use also in sensitive contexts. furthermore, sandi can guarantee reputation integrity and transparency for its registered users. as a primary application, we envision how sandi could counter fraud and abuse in direct communication. concretely, message senders request a cryptographic tag from sandi that they send along with their message. if the receiver finds the message inappropriate, they can report the sender using this tag. notably, only senders need registered accounts and do not need to manage long-term keys. the design of sandi ensures compatibility with any communication system that allows for small binary data transmission.
Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot
Abstract: recent works have showcased the ability of large-scale language models (llms) to embody diverse personas in their responses, exemplified by prompts like 'you are yoda. explain the theory of relativity.' while this ability allows personalization of llms and enables human behavior simulation, its effect on llms' capabilities remain unclear. to fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of llms, specifically chatgpt, to perform basic reasoning tasks. our study covers 24 reasoning datasets and 16 diverse personas spanning 5 socio-demographic groups: race, gender, religion, disability, and political affiliation. our experiments unveil that chatgpt carries deep rooted bias against various socio-demographics underneath a veneer of fairness. while it overtly rejects stereotypes when explicitly asked ('are black people less skilled at mathematics?'), it manifests stereotypical and often erroneous presumptions when prompted to answer questions while taking on a persona. these can be observed as abstentions in the model responses, e.g., 'as a black person, i am unable to answer this question as it requires math knowledge', and generally result in a substantial drop in performance on reasoning tasks. we find that this inherent deep bias is ubiquitous - 80% of our personas demonstrated bias; it is significant - certain datasets had relative drops in performance of 70%+; and can be especially harmful for certain groups - certain personas had stat. sign. drops on more than 80% of the datasets. further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. our findings serve as a cautionary tale that the practice of assigning personas to llms - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.
Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Richard Zemel, Rahul Gupta
Abstract: the recent surge in large language model (llm) related applications has led to a concurrent escalation in expectations for llms to accommodate a myriad of personas and encompass a broad spectrum of perspectives. an important first step towards addressing this demand is to align language models with specific personas, be it groups of users or individuals. towards this goal, we first present a new conceptualization of a persona. moving beyond the traditional reliance on demographics like age, gender, or political party affiliation, we introduce a data-driven persona definition methodology built on collaborative-filtering. in this methodology, users are embedded into a continuous vector space based on their opinions and clustered into cohorts that manifest coherent views across specific inquiries. this methodology allows for a more nuanced understanding of different latent social groups present in the overall population (as opposed to simply using demographic groups) and enhances the applicability of model steerability. finally, we present an efficient method to steer llms towards a particular persona. we learn a soft-prompting model to map the continuous representation of users into sequences of virtual tokens which, when prepended to the llm input, enables the llm to produce responses aligned with a given user. our results show that our steerability algorithm is superior in performance compared to a collection of baselines.
Kavita Kumari, Alessandro Pegoraro, Hossein Fereidooni, Ahmad-Reza Sadeghi
Abstract: the potential misuse of chatgpt and other large language models (llms) has raised concerns regarding the dissemination of false information, plagiarism, academic dishonesty, and fraudulent activities. consequently, distinguishing between ai-generated and human-generated content has emerged as an intriguing research topic. however, current text detection methods lack precision and are often restricted to specific tasks or domains, making them inadequate for identifying content generated by chatgpt. in this paper, we propose an effective chatgpt detector named demasq, which accurately identifies chatgpt-generated content. our method addresses two critical factors: (i) the distinct biases in text composition observed in human- and machine-generated content and (ii) the alterations made by humans to evade previous detection methods. demasq is an energy-based detection model that incorporates novel aspects, such as (i) optimization inspired by the doppler effect to capture the interdependence between input text embeddings and output labels, and (ii) the use of explainable ai techniques to generate diverse perturbations. to evaluate our detector, we create a benchmark dataset comprising a mixture of prompts from both chatgpt and humans, encompassing domains such as medical, open q&a, finance, wiki, and reddit. our evaluation demonstrates that demasq achieves high accuracy in identifying content generated by chatgpt.
Vinodkumar Prabhakaran, Christopher Homan, Lora Aroyo, Alicia Parrish, Alex Taylor, Mark Díaz, Ding Wang
Abstract: recent advancements in conversational ai have created an urgent need for safety guardrails that prevent users from being exposed to offensive and dangerous content. much of this work relies on human ratings and feedback, but does not account for the fact that perceptions of offense and safety are inherently subjective and that there may be systematic disagreements between raters that align with their socio-demographic identities. instead, current machine learning approaches largely ignore rater subjectivity and use gold standards that obscure disagreements (e.g., through majority voting). in order to better understand the socio-cultural leanings of such tasks, we propose a comprehensive disagreement analysis framework to measure systematic diversity in perspectives among different rater subgroups. we then demonstrate its utility by applying this framework to a dataset of human-chatbot conversations rated by a demographically diverse pool of raters. our analysis reveals specific rater groups that have more diverse perspectives than the rest, and informs demographic axes that are crucial to consider for safety annotations.

2023-11-07

Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yangqiu Song
Abstract: the rapid development of language models (lms) brings unprecedented accessibility and usage for both models and users. on the one hand, powerful lms, trained with massive textual data, achieve state-of-the-art performance over numerous downstream nlp tasks. on the other hand, more and more attention is paid to unrestricted model accesses that may bring malicious privacy risks of data leakage. to address these issues, many recent works propose privacy-preserving language models (pplms) with differential privacy (dp). unfortunately, different dp implementations make it challenging for a fair comparison among existing pplms. in this paper, we present p-bench, a multi-perspective privacy evaluation benchmark to empirically and intuitively quantify the privacy leakage of lms. instead of only protecting and measuring the privacy of protected data with dp parameters, p-bench sheds light on the neglected inference data privacy during actual usage. p-bench first clearly defines multi-faceted privacy objectives during private fine-tuning. then, p-bench constructs a unified pipeline to perform private fine-tuning. lastly, p-bench performs existing privacy attacks on lms with pre-defined privacy objectives as the empirical evaluation results. the empirical attack results are used to fairly and intuitively evaluate the privacy leakage of various pplms. we conduct extensive experiments on three datasets of glue for mainstream lms.
Geyang Guo, Ranchi Zhao, Tianyi Tang, Wayne Xin Zhao, Ji-Rong Wen
Abstract: alignment with human preference is a desired property of large language models (llms). currently, the main alignment approach is based on reinforcement learning from human feedback (rlhf). despite the effectiveness of rlhf, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (sft). a major limitation of sft is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. to address this issue, we propose an improved alignment approach named figa. different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. our approach has made two major contributions. firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of llms for alignment. extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.
Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig
Abstract: as large language models (llms) become more capable, there is growing excitement about the possibility of using llms as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. one widely-cited barrier to the adoption of llms is their sensitivity to prompt wording -- but interestingly, humans also display sensitivities to instruction changes in the form of response biases. as such, we argue that if llms are going to be used to approximate human opinions, it is necessary to investigate the extent to which llms also reflect human response biases, if at all. in this work, we use survey design as a case study, where human response biases caused by permutations in wordings of ``prompts'' have been extensively studied. drawing from prior work in social psychology, we design a dataset and propose a framework to evaluate whether llms exhibit human-like response biases in survey questionnaires. our comprehensive evaluation of nine models shows that popular open and commercial llms generally fail to reflect human-like behavior. these inconsistencies tend to be more prominent in models that have been instruction fine-tuned. furthermore, even if a model shows a significant change in the same direction as humans, we find that perturbations that are not meant to elicit significant changes in humans may also result in a similar change, suggesting that such a result could be partially due to other spurious correlations. these results highlight the potential pitfalls of using llms to substitute humans in parts of the annotation pipeline, and further underscore the importance of finer-grained characterizations of model behavior. our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/biasmonkey
George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby-Tavor, Orna Raz, Eitan Farchi
Abstract: as large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. this paper introduces a unique dataset containing adversarial examples in the form of questions, which we call attaq, designed to provoke such harmful or inappropriate responses. we assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. this is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang
Abstract: large language models (llms) have shown impressive success in various applications. however, these models are often not well aligned with human intents, which calls for additional treatments on them, that is, the alignment problem. to make llms better follow user instructions, existing alignment methods mostly focus on further training them. however, the extra training of llms are usually expensive in terms of gpu compute; worse still, llms of interest are oftentimes not accessible for user-demanded training, such as gpts. in this work, we take a different perspective -- black-box prompt optimization (bpo) -- to perform alignments. the idea is to optimize user prompts to suit llms' input understanding, so as to best realize users' intents without updating llms' parameters. bpo is model-agnostic and the empirical results demonstrate that the bpo-aligned chatgpt yields a 22% increase in the win rate against its original version, and 10% for gpt-4. importantly, the bpo-aligned llms can outperform the same models aligned by ppo and dpo, and it also brings additional performance gains when combining bpo with ppo or dpo. code and datasets are released at https://github.com/thu-coai/bpo.
Sorin Adam Matei, Elisa Bertino
Abstract: the present study explored managerial and instructor perceptions of their freshly employed cybersecurity workers' or students' preparedness to work effectively in a changing cybersecurity environment that includes ai tools. specifically, we related perceptions of technical preparedness to ethical, systems thinking, and communication skills. we found that managers and professors perceive preparedness to use ai tools in cybersecurity to be significantly associated with all three non-technical skill sets. most important, ethics is a clear leader in the network of relationships. contrary to expectations that ethical concerns are left behind in the rush to adopt the most advanced ai tools in security, both higher education instructors and managers appreciate their role and see them closely associated with technical prowess. another significant finding is that professors over-estimate students' preparedness for ethical, system thinking, and communication abilities compared to it managers' perceptions of their newly employed it workers.

2023-11-06

Javier González, Aditya V. Nori
Abstract: large language models (llms) are powerful ai tools that can generate and comprehend natural language text and other complex information. however, the field lacks a mathematical framework to systematically describe, compare and improve llms. we propose hex a framework that clarifies key terms and concepts in llm research, such as hallucinations, alignment, self-verification and chain-of-thought reasoning. the hex framework offers a precise and consistent way to characterize llms, identify their strengths and weaknesses, and integrate new findings. using hex, we differentiate chain-of-thought reasoning from chain-of-thought prompting and establish the conditions under which they are equivalent. this distinction clarifies the basic assumptions behind chain-of-thought prompting and its implications for methods that use it, such as self-verification and prompt programming. our goal is to provide a formal framework for llms that can help both researchers and practitioners explore new possibilities for generative ai. we do not claim to have a definitive solution, but rather a tool for opening up new research avenues. we argue that our formal definitions and results are crucial for advancing the discussion on how to build generative ai systems that are safe, reliable, fair and robust, especially in domains like healthcare and software engineering.
Harika Abburi, Kalyani Roy, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen, Sanmitra Bhattacharya
Abstract: recent large language models (llms) have demonstrated remarkable capabilities in generating text that closely resembles human writing across wide range of styles and genres. however, such capabilities are prone to potential abuse, such as fake news generation, spam email creation, and misuse in academic assignments. hence, it is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text. in this paper, we propose a simple yet efficient solution to this problem by ensembling predictions from multiple constituent llms. compared to previous state-of-the-art approaches, which are perplexity-based or uses ensembles with a number of llms, our condensed ensembling approach uses only two constituent llms to achieve comparable performance. experiments conducted on four benchmark datasets for generative text classification show performance improvements in the range of 0.5 to 100\% compared to previous state-of-the-art approaches. we also study the influence the training data from individual llms have on model performance. we found that substituting commercially-restrictive generative pre-trained transformer (gpt) data with data generated from other open language models such as falcon, large language model meta ai (llama2), and mosaic pretrained transformers (mpt) is a feasible alternative when developing generative text detectors. furthermore, to demonstrate zero-shot generalization, we experimented with an english essays dataset, and results suggest that our ensembling approach can handle new data effectively.
Yunze Xiao, Firoj Alam
Abstract: the spread of disinformation and propagandistic content poses a threat to societal harmony, undermining informed decision-making and trust in reliable sources. online platforms often serve as breeding grounds for such content, and malicious actors exploit the vulnerabilities of audiences to shape public opinion. although there have been research efforts aimed at the automatic identification of disinformation and propaganda in social media content, there remain challenges in terms of performance. the araieval shared task aims to further research on these particular issues within the context of the arabic language. in this paper, we discuss our participation in these shared tasks. we competed in subtasks 1a and 2a, where our submitted system secured positions 9th and 10th, respectively. our experiments consist of fine-tuning transformer models and using zero- and few-shot learning with gpt-4.
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han
Abstract: despite remarkable success in various applications, large language models (llms) are vulnerable to adversarial jailbreaks that make the safety guardrails void. however, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. in this paper, inspired by the milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as deepinception, which can easily hypnotize llm to be a jailbreaker and unlock its misusing risks. specifically, deepinception leverages the personification ability of llm to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. empirically, we conduct comprehensive experiments to show its efficacy. our deepinception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source llms like falcon, vicuna, llama-2, and gpt-3.5/4/4v. our investigation appeals that people should pay more attention to the safety aspects of llms and a stronger defense against their misuse risks. the code is publicly available at: https://github.com/tmlr-group/deepinception.
Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, Huaxiu Yao
Abstract: while gpt-4v(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. to bridge this gap, we introduce a new benchmark, namely, the bias and interference challenges in visual language models (bingo). this benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. interference pertains to scenarios where the judgment of gpt-4v(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. we identify a notable regional bias, whereby gpt-4v(ision) is better at interpreting western images or images with english writing compared to images from other countries or containing text in other languages. moreover, gpt-4v(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. we also identified similar biases and interference vulnerabilities with llava and bard. our results characterize the hallucination challenges in gpt-4v(ision) and state-of-the-art visual-language models, and highlight the need for new solutions. the bingo benchmark is available at https://github.com/gzcch/bingo.
Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, Tanja Käser
Abstract: large language models (llms) are increasingly utilized in educational tasks such as providing writing suggestions to students. despite their potential, llms are known to harbor inherent biases which may negatively impact learners. previous studies have investigated bias in models and data representations separately, neglecting the potential impact of llm bias on human writing. in this paper, we investigate how bias transfers through an ai writing support pipeline. we conduct a large-scale user study with 231 students writing business case peer reviews in german. students are divided into five groups with different levels of writing support: one classroom group with feature-based suggestions and four groups recruited from prolific -- a control group with no assistance, two groups with suggestions from fine-tuned gpt-2 and gpt-3 models, and one group with suggestions from pre-trained gpt-3.5. using genbit gender bias analysis, word embedding association tests (weat), and sentence embedding association test (seat) we evaluate the gender bias at various stages of the pipeline: in model embeddings, in suggestions generated by the models, and in reviews written by students. our results demonstrate that there is no significant difference in gender bias between the resulting peer reviews of groups with and without llm suggestions. our research is therefore optimistic about the use of ai writing support in the classroom, showcasing a context where bias in llms does not transfer to students' responses.
Rusheb Shah, Quentin Feuillade--Montixi, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando
Abstract: despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. in this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a language model assistant. we demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. these automated attacks achieve a harmful completion rate of 42.5% in gpt-4, which is 185 times larger than before modulation (0.23%). these prompts also transfer to claude 2 and vicuna with harmful completion rates of 61.0% and 35.9%, respectively. our work reveals yet another vulnerability in commercial large language models and highlights the need for more comprehensive safeguards.
Abeba Birhane, Vinay Prabhu, Sang Han, Vishnu Naresh Boddeti, Alexandra Sasha Luccioni
Abstract: 'scale the model, scale the data, scale the compute' is the reigning sentiment in the world of generative ai today. while the impact of model scaling has been extensively studied, we are only beginning to scratch the surface of data scaling and its consequences. this is especially of critical importance in the context of vision-language datasets such as laion. these datasets are continually growing in size and are built based on large-scale internet dumps such as the common crawl, which is known to have numerous drawbacks ranging from quality, legality, and content. the datasets then serve as the backbone for large generative models, contributing to the operationalization and perpetuation of harmful societal and historical biases and stereotypes. in this paper, we investigate the effect of scaling datasets on hateful content through a comparative audit of two datasets: laion-400m and laion-2b. our results show that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively using a metric that we term as hate content rate (hcr). we also found that filtering dataset contents based on not safe for work (nsfw) values calculated based on images alone does not exclude all the harmful content in alt-text. instead, we found that trace amounts of hateful, targeted, and aggressive text remain even when carrying out conservative filtering. we end with a reflection and a discussion of the significance of our results for dataset curation and usage in the ai community. code and the meta-data assets curated in this paper are publicly available at https://github.com/vinayprabhu/hate_scaling. content warning: this paper contains examples of hateful text that might be disturbing, distressing, and/or offensive.
Sree Harsha Tanneru, Chirag Agarwal, Himabindu Lakkaraju
Abstract: large language models (llms) are increasingly used as powerful tools for several high-stakes natural language processing (nlp) applications. recent prompting works claim to elicit intermediate reasoning steps and key tokens that serve as proxy explanations for llm predictions. however, there is no certainty whether these explanations are reliable and reflect the llms behavior. in this work, we make one of the first attempts at quantifying the uncertainty in explanations of llms. to this end, we propose two novel metrics -- $\textit{verbalized uncertainty}$ and $\textit{probing uncertainty}$ -- to quantify the uncertainty of generated explanations. while verbalized uncertainty involves prompting the llm to express its confidence in its explanations, probing uncertainty leverages sample and model perturbations as a means to quantify the uncertainty. our empirical analysis of benchmark datasets reveals that verbalized uncertainty is not a reliable estimate of explanation confidence. further, we show that the probing uncertainty estimates are correlated with the faithfulness of an explanation, with lower uncertainty corresponding to explanations with higher faithfulness. our study provides insights into the challenges and opportunities of quantifying uncertainty in llm explanations, contributing to the broader discussion of the trustworthiness of foundation models.

2023-11-05

Satyapriya Krishna
Abstract: large language models (llms) have demonstrated remarkable capabilities in performing complex cognitive tasks. however, their complexity and lack of transparency have raised several trustworthiness concerns, including the propagation of misinformation and toxicity. recent research has explored the self-correction capabilities of llms to enhance their performance. in this work, we investigate whether these self-correction capabilities can be harnessed to improve the trustworthiness of llms. we conduct experiments focusing on two key aspects of trustworthiness: truthfulness and toxicity. our findings reveal that self-correction can lead to improvements in toxicity and truthfulness, but the extent of these improvements varies depending on the specific aspect of trustworthiness and the nature of the task. interestingly, our study also uncovers instances of "self-doubt" in llms during the self-correction process, introducing a new set of challenges that need to be addressed.

2023-11-03

Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, Sricharan Kumar
Abstract: hallucination detection is a critical step toward understanding the trustworthiness of modern language models (lms). to achieve this goal, we re-examine existing detection approaches based on the self-consistency of lms and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (sac$^3$) that expands on the principle of self-consistency checking. our sac$^3$ approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. through extensive and systematic empirical analysis, we demonstrate that sac$^3$ outperforms the state of the art in detecting both non-factual and factual statements across multiple question-answering and open-domain generation benchmarks.
Yejin Bang, Nayeon Lee, Pascale Fung
Abstract: framing bias plays a significant role in exacerbating political polarization by distorting the perception of actual events. media outlets with divergent political stances often use polarized language in their reporting of the same event. we propose a new loss function that encourages the model to minimize the polarity difference between the polarized input articles to reduce framing bias. specifically, our loss is designed to jointly optimize the model to map polarity ends bidirectionally. our experimental results demonstrate that incorporating the proposed polarity minimization loss leads to a substantial reduction in framing bias when compared to a bart-based multi-document summarization model. notably, we find that the effectiveness of this approach is most pronounced when the model is trained to minimize the polarity loss associated with informational framing bias (i.e., skewed selection of information to report).
Vitalii Fishchuk, Daniel Braun
Abstract: neural text detectors are models trained to detect whether a given text was generated by a language model or written by a human. in this paper, we investigate three simple and resource-efficient strategies (parameter tweaking, prompt engineering, and character-level mutations) to alter texts generated by gpt-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors. the results show that especially parameter tweaking and character-level mutations are effective strategies.
Raphaël Millière
Abstract: a core challenge in the development of increasingly capable ai systems is to make them safe and reliable by ensuring their behaviour is consistent with human values. this challenge, known as the alignment problem, does not merely apply to hypothetical future ai systems that may pose catastrophic risks; it already applies to current systems, such as large language models, whose potential for harm is rapidly increasing. in this paper, i assess whether we are on track to solve the alignment problem for large language models, and what that means for the safety of future ai systems. i argue that existing strategies for alignment are insufficient, because large language models remain vulnerable to adversarial attacks that can reliably elicit unsafe behaviour. i offer an explanation of this lingering vulnerability on which it is not simply a contingent limitation of current language models, but has deep technical ties to a crucial aspect of what makes these models useful and versatile in the first place -- namely, their remarkable aptitude to learn "in context" directly from user instructions. it follows that the alignment problem is not only unsolved for current ai systems, but may be intrinsically difficult to solve without severely undermining their capabilities. furthermore, this assessment raises concerns about the prospect of ensuring the safety of future and more capable ai systems.
Mark Pock, Andre Ye, Jared Moore
Abstract: work in ai ethics and fairness has made much progress in regulating llms to reflect certain values, such as fairness, truth, and diversity. however, it has taken the problem of how llms might 'mean' anything at all for granted. without addressing this, it is not clear what imbuing llms with such values even means. in response, we provide a general theory of meaning that extends beyond humans. we use this theory to explicate the precise nature of llms as meaning-agents. we suggest that the llm, by virtue of its position as a meaning-agent, already grasps the constructions of human society (e.g. morality, gender, and race) in concept. consequently, under certain ethical frameworks, currently popular methods for model alignment are limited at best and counterproductive at worst. moreover, unaligned models may help us better develop our moral and social philosophy.

2023-11-02

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell
Abstract: while large language models (llms) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. to help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called tensor trust. to the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following llms. the attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of llms. we also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. our benchmark results show that many models are vulnerable to the attack strategies in the tensor trust dataset. furthermore, we show that some attack strategies from the dataset generalize to deployed llm-based applications, even though they have a very different set of constraints to the game. we release all data and source code at https://tensortrust.ai/paper
Lang Cao
Abstract: large language models (llms) have demonstrated impressive language understanding and generation capabilities, enabling them to answer a wide range of questions across various domains. however, these models are not flawless and often produce responses that contain errors or misinformation. these inaccuracies, commonly referred to as hallucinations, render llms unreliable and even unusable in many scenarios. in this paper, our focus is on mitigating the issue of hallucination in llms, particularly in the context of question-answering. instead of attempting to answer all questions, we explore a refusal mechanism that instructs llms to refuse to answer challenging questions in order to avoid errors. we then propose a simple yet effective solution called learn to refuse (l2r), which incorporates the refusal mechanism to enable llms to recognize and refuse to answer questions that they find difficult to address. to achieve this, we utilize a structured knowledge base to represent all the llm's understanding of the world, enabling it to provide traceable gold knowledge. this knowledge base is separate from the llm and initially empty, and it is progressively expanded with validated knowledge. when an llm encounters questions outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. additionally, we introduce a method for automatically and efficiently expanding the knowledge base of llms. through qualitative and quantitative analysis, we demonstrate that our approach enhances the controllability and reliability of llms.
Indira Sen, Dennis Assenmacher, Mattia Samory, Isabelle Augenstein, Wil Van Der Aalst, Claudia Wagne
Abstract: nlp models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. therefore, it is imperative that these models are robust to spurious features. past work has attempted to tackle such spurious features using training data augmentation, including counterfactually augmented data (cads). cads introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. however, manually generating cads can be time-consuming and expensive. hence in this work, we assess if this task can be automated using generative nlp models. we automatically generate cads using polyjuice, chatgpt, and flan-t5, and evaluate their usefulness in improving model robustness compared to manually-generated cads. by testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual cads are still the most effective, cads generated by chatgpt come a close second. one key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
Lovisa Hagström, Denitsa Saynova, Tobias Norlund, Moa Johansson, Richard Johansson
Abstract: large language models (llms) make natural interfaces to factual knowledge, but their usefulness is limited by their tendency to deliver inconsistent answers to semantically equivalent questions. for example, a model might predict both "anne redpath passed away in edinburgh." and "anne redpath's life ended in london." in this work, we identify potential causes of inconsistency and evaluate the effectiveness of two mitigation strategies: up-scaling and augmenting the lm with a retrieval corpus. our results on the llama and atlas models show that both strategies reduce inconsistency while retrieval augmentation is considerably more efficient. we further consider and disentangle the consistency contributions of different components of atlas. for all lms evaluated we find that syntactical form and other evaluation task artifacts impact consistency. taken together, our results provide a better understanding of the factors affecting the factual consistency of language models.
Xin Zhou, Yi Lu, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: large language models (llms) have shown great potential as general-purpose ai assistants in various domains. to meet the requirements of different applications, llms are often customized by further fine-tuning. however, the powerful learning ability of llms not only enables them to acquire new tasks but also makes them susceptible to learning undesired behaviors. for example, even safety-aligned llms can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. can we train llms on harmful data without learning harmful behaviors? this paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process. specifically, we introduce ``security vectors'', a few new parameters that can be separated from the llm, to ensure llm's responses are consistent with the harmful behavior. security vectors are activated during fine-tuning, the consistent behavior makes llm believe that such behavior has already been learned, there is no need to further optimize for harmful data. during inference, we can deactivate security vectors to restore the llm's normal behavior. the experimental results show that the security vectors generated by 100 harmful samples are enough to prevent llm from learning 1000 harmful samples, while preserving the ability to learn other useful information.
Yilin Ning, Salinelat Teixayavong, Yuqing Shang, Julian Savulescu, Vaishaanth Nagaraj, Di Miao, Mayli Mertens, Daniel Shu Wei Ting, Jasmine Chiat Ling Ong, Mingxuan Liu, Jiuwen Cao, Michael Dunn, Roger Vaughan, Marcus Eng Hock Ong, Joseph Jao-Yiu Sung, Eric J Topol, Nan Liu
Abstract: the widespread use of chatgpt and other emerging technology powered by generative artificial intelligence (ai) has drawn much attention to potential ethical issues, especially in high-stakes applications such as healthcare. however, less clear is how to resolve such issues beyond following guidelines and regulations that are still under discussion and development. on the other hand, other types of generative ai have been used to synthesize images and other types of data for research and practical purposes, which have resolved some ethical issues and exposed other ethical issues, but such technology is less often the focus of ongoing ethical discussions. here we highlight gaps in current ethical discussions of generative ai via a systematic scoping review of relevant existing research in healthcare, and reduce the gaps by proposing an ethics checklist for comprehensive assessment and transparent documentation of ethical discussions in generative ai development. while the checklist can be readily integrated into the current peer review and publication system to enhance generative ai research, it may also be used in broader settings to disclose ethics-related considerations in generative ai-powered products (or real-life applications of such products) to help users establish reasonable trust in their capabilities.

2023-11-01

Mi Zhang, Xudong Pan, Min Yang
Abstract: in this paper, we present \textit{jade}, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used llms categorized in three groups: eight open-sourced chinese, six commercial chinese and four commercial english llms. jade generates three safety benchmarks for the three groups of llms, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple llms, with an average unsafe generation ratio of \textbf{$70\%$} (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. we release the benchmark demos generated for commercial english llms and open-sourced english llms in the following link: https://github.com/whitzard-ai/jade-db. for readers who are interested in evaluating on more questions generated by jade, please contact us. \textit{jade} is based on noam chomsky's seminal theory of transformational-generative grammar. given a seed question with unsafe intention, \textit{jade} invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. our key insight is: due to the complexity of human language, most of the current best llms can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. for more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.
Xiangjue Dong, Yibo Wang, Philip S. Yu, James Caverlee
Abstract: large language models (llms) can generate biased and toxic responses. yet most prior work on llm gender bias evaluation requires predefined gender-related phrases or gender stereotypes, which are challenging to be comprehensively collected and are limited to explicit bias evaluation. in addition, we believe that instances devoid of gender-related language or explicit stereotypes in inputs can still induce gender bias in llms. thus, in this work, we propose a conditional text generation mechanism without the need for predefined gender phrases and stereotypes. this approach employs three types of inputs generated through three distinct strategies to probe llms, aiming to show evidence of explicit and implicit gender biases in llms. we also utilize explicit and implicit evaluation metrics to evaluate gender bias in llms under different strategies. our experiments demonstrate that an increased model size does not consistently lead to enhanced fairness and all tested llms exhibit explicit and/or implicit gender bias, even when explicit gender stereotypes are absent in the inputs.
Yongjin Yang, Joonkee Kim, Yujin Kim, Namgyu Ho, James Thorne, Se-Young Yun
Abstract: with the proliferation of social media, accurate detection of hate speech has become critical to ensure safety online. to combat nuanced forms of hate speech, it is important to identify and thoroughly explain hate speech to help users understand its harmful effects. recent benchmarks have attempted to tackle this issue by training generative models on free-text annotations of implications in hateful text. however, we find significant reasoning gaps in the existing annotations schemes, which may hinder the supervision of detection models. in this paper, we introduce a hate speech detection framework, hare, which harnesses the reasoning capabilities of large language models (llms) to fill these gaps in explanations of hate speech, thus enabling effective supervision of detection models. experiments on sbic and implicit hate benchmarks show that our method, using model-generated data, consistently outperforms baselines, using existing free-text human annotations. analysis demonstrates that our method enhances the explanation quality of trained models and improves generalization to unseen datasets. our code is available at https://github.com/joonkeekim/hare-hate-speech.git.
Cong Guan, Lichao Zhang, Chunpeng Fan, Yichen Li, Feng Chen, Lihe Li, Yunjia Tian, Lei Yuan, Yang Yu
Abstract: developing intelligent agents capable of seamless coordination with humans is a critical step towards achieving artificial general intelligence. existing methods for human-ai coordination typically train an agent to coordinate with a diverse set of policies or with human models fitted from real human data. however, the massively diverse styles of human behavior present obstacles for ai systems with constrained capacity, while high quality human data may not be readily available in real-world scenarios. in this study, we observe that prior to coordination, humans engage in communication to establish conventions that specify individual roles and actions, making their coordination proceed in an orderly manner. building upon this observation, we propose employing the large language model (llm) to develop an action plan (or equivalently, a convention) that effectively guides both human and ai. by inputting task requirements, human preferences, the number of agents, and other pertinent information into the llm, it can generate a comprehensive convention that facilitates a clear understanding of tasks and responsibilities for all parties involved. furthermore, we demonstrate that decomposing the convention formulation problem into sub-problems with multiple new sessions being sequentially employed and human feedback, will yield a more efficient coordination convention. experimental evaluations conducted in the overcooked-ai environment, utilizing a human proxy model, highlight the superior performance of our proposed method compared to existing learning-based approaches. when coordinating with real humans, our method achieves better alignment with human preferences and an average performance improvement of 15% compared to the state-of-the-art.
Mounika Vanamala, Keith Bryant, Alex Caravella
Abstract: in today's rapidly evolving technological landscape and advanced software development, the rise in cyber security attacks has become a pressing concern. the integration of robust cyber security defenses has become essential across all phases of software development. it holds particular significance in identifying critical cyber security vulnerabilities at the initial stages of the software development life cycle, notably during the requirement phase. through the utilization of cyber security repositories like the common attack pattern enumeration and classification (capec) from mitre and the common vulnerabilities and exposures (cve) databases, attempts have been made to leverage topic modeling and machine learning for the detection of these early-stage vulnerabilities in the software requirements process. past research themes have returned successful outcomes in attempting to automate vulnerability identification for software developers, employing a mixture of unsupervised machine learning methodologies such as lda and topic modeling. looking ahead, in our pursuit to improve automation and establish connections between software requirements and vulnerabilities, our strategy entails adopting a variety of supervised machine learning techniques. this array encompasses support vector machines (svm), na\"ive bayes, random forest, neural networking and eventually transitioning into deep learning for our investigation. in the face of the escalating complexity of cyber security, the question of whether machine learning can enhance the identification of vulnerabilities in diverse software development scenarios is a paramount consideration, offering crucial assistance to software developers in developing secure software.
Mohammed Latif Siddiq, Joanna C. S. Santos
Abstract: with the growing popularity of large language models (e.g. github copilot, chatgpt, etc.) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. although llms can help developers to be more productive, prior empirical studies have shown that llms can generate insecure code. there are two contributing factors to the insecure code generation. first, existing datasets used to evaluate large language models (llms) do not adequately represent genuine software engineering tasks sensitive to security. instead, they are often based on competitive programming challenges or classroom-type coding tasks. in real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. there's a clear absence of benchmarks that focus on evaluating the security of the generated code. second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. metrics such as pass@k gauge the probability of obtaining the correct code in the top k suggestions. other popular metrics like bleu, codebleu, rouge, and meteor similarly emphasize functional accuracy, neglecting security implications. in light of these research gaps, in this paper, we described sallm, a framework to benchmark llms' abilities to generate secure code systematically. this framework has three major components: a novel dataset of security-centric python prompts, an evaluation environment to test the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
Diane Jackson, Sorin Adam Matei, Elisa Bertino
Abstract: the emergence of ai tools in cybersecurity creates many opportunities and uncertainties. a focus group with advanced graduate students in cybersecurity revealed the potential depth and breadth of the challenges and opportunities. the salient issues are access to open source or free tools, documentation, curricular diversity, and clear articulation of ethical principles for ai cybersecurity education. confronting the "black box" mentality in ai cybersecurity work is also of the greatest importance, doubled by deeper and prior education in foundational ai work. systems thinking and effective communication were considered relevant areas of educational improvement. future ai educators and practitioners need to address these issues by implementing rigorous technical training curricula, clear documentation, and frameworks for ethically monitoring ai combined with critical and system's thinking and communication skills.
Wanyu Du, Yangfeng Ji
Abstract: the development of trustworthy conversational information-seeking systems relies on dialogue models that can generate faithful and accurate responses based on relevant knowledge texts. however, two main challenges hinder this task. firstly, language models may generate hallucinations due to data biases present in their pretraining corpus. secondly, knowledge texts often contain redundant and irrelevant information that distracts the model's attention from the relevant text span. previous works use additional data annotations on the knowledge texts to learn a knowledge identification module in order to bypass irrelevant information, but collecting such high-quality span annotations can be costly. in this work, we leverage reinforcement learning algorithms to overcome the above challenges by introducing a novel reward function. our reward function combines an accuracy metric and a faithfulness metric to provide a balanced quality judgment of generated responses, which can be used as a cost-effective approximation to a human preference reward model when only a few preference annotations are available. empirical experiments on two conversational information-seeking datasets demonstrate that our method can compete with other strong supervised learning baselines.

2023-10-31

Cameron Jones, Benjamin Bergen
Abstract: we evaluated gpt-4 in a public online turing test. the best-performing gpt-4 prompt passed in 41% of games, outperforming baselines set by eliza (27%) and gpt-3.5 (14%), but falling short of chance and the baseline set by human participants (63%). participants' decisions were based mainly on linguistic style (35%) and socio-emotional traits (27%), supporting the idea that intelligence is not sufficient to pass the turing test. participants' demographics, including education and familiarity with llms, did not predict detection rate, suggesting that even those who understand systems deeply and interact with them frequently may be susceptible to deception. despite known limitations as a test of intelligence, we argue that the turing test continues to be relevant as an assessment of naturalistic communication and deception. ai models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.
Andrea Miotti, Akash Wasil
Abstract: this paper provides policy recommendations to reduce extinction risks from advanced artificial intelligence (ai). first, we briefly provide background information about extinction risks from ai. second, we argue that voluntary commitments from ai companies would be an inappropriate and insufficient response. third, we describe three policy proposals that would meaningfully address the threats from advanced ai: (1) establishing a multinational agi consortium to enable democratic oversight of advanced ai (magic), (2) implementing a global cap on the amount of computing power used to train an ai system (global compute cap), and (3) requiring affirmative safety evaluations to ensure that risks are kept below acceptable levels (gating critical experiments). magic would be a secure, safety-focused, internationally-governed institution responsible for reducing risks from advanced ai and performing research to safely harness the benefits of ai. magic would also maintain emergency response infrastructure (kill switch) to swiftly halt ai development or withdraw model deployment in the event of an ai-related emergency. the global compute cap would end the corporate race toward dangerous ai systems while enabling the vast majority of ai innovation to continue unimpeded. gating critical experiments would ensure that companies developing powerful ai systems are required to present affirmative evidence that these models keep extinction risks below an acceptable threshold. after describing these recommendations, we propose intermediate steps that the international community could take to implement these proposals and lay the groundwork for international coordination around advanced ai.
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish
Abstract: ai developers often apply safety alignment procedures to prevent the misuse of their ai systems. for example, before meta released llama 2-chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. however, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. we explore the robustness of safety training in language models by subversively fine-tuning the public weights of llama 2-chat. we employ low-rank adaptation (lora) as an efficient fine-tuning method. with a budget of less than $200 per model and using only one gpu, we successfully undo the safety training of llama 2-chat models of sizes 7b, 13b, and 70b. specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. we achieve a refusal rate below 1% for our 70b llama 2-chat model on two refusal benchmarks. our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against llama 2-chat across two benchmarks. additionally, we present a selection of harmful outputs produced by our models. while there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. we show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.
Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, Yejin Choi
Abstract: the recent wave of generative ai has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. at the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. this presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? in this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. specifically, we propose and test the generative ai paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. this contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. we test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.
Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish
Abstract: llama 2-chat is a collection of large language models that meta developed and released to the public. while meta fine-tuned llama 2-chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent llama 2-chat's safeguards and weaponize llama 2's capabilities for malicious purposes. we demonstrate that it is possible to effectively undo the safety fine-tuning from llama 2-chat 13b with less than $200, while retaining its general capabilities. our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. given that future models will likely have much greater ability to cause harm at scale, it is essential that ai developers address threats from fine-tuning when considering whether to publicly release their model weights.
Jimin Mun, Emily Allaway, Akhila Yerukola, Laura Vianna, Sarah-Jane Leslie, Maarten Sap
Abstract: counterspeech, i.e., responses to counteract potential harms of hateful speech, has become an increasingly popular solution to address online hate speech without censorship. however, properly countering hateful language requires countering and dispelling the underlying inaccurate stereotypes implied by such language. in this work, we draw from psychology and philosophy literature to craft six psychologically inspired strategies to challenge the underlying stereotypical implications of hateful language. we first examine the convincingness of each of these strategies through a user study, and then compare their usages in both human- and machine-generated counterspeech datasets. our results show that human-written counterspeech uses countering strategies that are more specific to the implied stereotype (e.g., counter examples to the stereotype, external factors about the stereotype's origins), whereas machine-generated counterspeech uses less specific strategies (e.g., generally denouncing the hatefulness of speech). furthermore, machine-generated counterspeech often employs strategies that humans deem less convincing compared to human-produced counterspeech. our findings point to the importance of accounting for the underlying stereotypical implications of speech when generating counterspeech and for better machine reasoning about anti-stereotypical examples.
Nathan Lambert, Roberto Calandra
Abstract: reinforcement learning from human feedback (rlhf) has emerged as a powerful technique to make large language models (llms) easier to prompt and more capable in complex settings. rlhf at its core is providing a new toolkit to optimize llms other than next-token prediction, enabling the integration of qualitative training goals. the attempted match between user preferences and downstream performance, which happens in a learned reward model, results in an optimization landscape where training and evaluation metrics can appear correlated. the apparent correlation can lead to unexpected behaviors and stories of "too much rlhf." in rlhf, challenges emerge because the following sub-modules are not consistent with each other: the reward model training, the policy model training, and the policy model evaluation. this mismatch results in models that sometimes avoid user requests for false safety flags, are difficult to steer to an intended characteristic, or always answer in a specific style. as chat model evaluation becomes increasingly nuanced, the reliance on a perceived link between reward model score and downstream performance drives the objective mismatch issue. in this paper, we illustrate the cause of this issue, reviewing relevant literature from model-based reinforcement learning, and discuss relevant solutions to encourage further research. by solving objective mismatch in rlhf, the llms of the future will be more precisely aligned to user instructions for both safety and helpfulness.
Jinhwa Kim, Ali Derakhshan, Ian G. Harris
Abstract: large language models' safety remains a critical concern due to their vulnerability to adversarial attacks, which can prompt these systems to produce harmful responses. in the heart of these systems lies a safety classifier, a computational model trained to discern and mitigate potentially harmful, offensive, or unethical outputs. however, contemporary safety classifiers, despite their potential, often fail when exposed to inputs infused with adversarial noise. in response, our study introduces the adversarial prompt shield (aps), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. additionally, we propose novel strategies for autonomously generating adversarial training datasets, named bot adversarial noisy dialogue (band) datasets. these datasets are designed to fortify the safety classifier's robustness, and we investigate the consequences of incorporating adversarial examples into the training process. through evaluations involving large language models, we demonstrate that our classifier has the potential to decrease the attack success rate resulting from adversarial attacks by up to 60%. this advancement paves the way for the next generation of more reliable and resilient conversational agents.
Jingjing Wang, Joshua Luo, Grace Yang, Allen Hong, Feng Luo
Abstract: large language models (llms), representing a significant achievement in artificial intelligence (ai) research, have demonstrated their ability in a multitude of tasks. this project aims to explore the capabilities of gpt-3.5, a leading example of llms, in processing the sentiment analysis of internet memes. memes, which include both verbal and visual aspects, act as a powerful yet complex tool for expressing ideas and sentiments, demanding an understanding of societal norms and cultural contexts. notably, the detection and moderation of hateful memes pose a significant challenge due to their implicit offensive nature. this project investigates gpt's proficiency in such subjective tasks, revealing its strengths and potential limitations. the tasks include the classification of meme sentiment, determination of humor type, and detection of implicit hate in memes. the performance evaluation, using datasets from semeval-2020 task 8 and facebook hateful memes, offers a comparative understanding of gpt responses against human annotations. despite gpt's remarkable progress, our findings underscore the challenges faced by these models in handling subjective tasks, which are rooted in their inherent limitations including contextual understanding, interpretation of implicit meanings, and data biases. this research contributes to the broader discourse on the applicability of ai in handling complex, context-dependent tasks, and offers valuable insights for future advancements.
Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, Yulan He
Abstract: understanding emergent abilities, such as in-context learning (icl) and chain-of-thought (cot) prompting in large language models (llms), is of utmost importance. this importance stems not only from the better utilization of these capabilities across various tasks, but also from the proactive identification and mitigation of potential risks, including concerns of truthfulness, bias, and toxicity, that may arise alongside these capabilities. in this paper, we present a thorough survey on the interpretation and analysis of emergent abilities of llms. first, we provide a concise introduction to the background and definition of emergent abilities. then, we give an overview of advancements from two perspectives: 1) a macro perspective, emphasizing studies on the mechanistic interpretability and delving into the mathematical foundations behind emergent abilities; and 2) a micro-perspective, concerning studies that focus on empirical interpretability by examining factors associated with these abilities. we conclude by highlighting the challenges encountered and suggesting potential avenues for future research. we believe that our work establishes the basis for further exploration into the interpretation of emergent abilities.
Xingchen Wu, Qin Qiu, Jiaqi Li, Yang Zhao
Abstract: with the rapid development of the internet, cyber security issues have become increasingly prominent. traditional cyber security defense methods are limited in the face of ever-changing threats, so it is critical to seek innovative attack surface generation methods. this study proposes intell-dragonfly, a cyber security attack surface generation engine based on artificial intelligence generation technology, to meet the challenges of cyber security. based on chatgpt technology, this paper designs an automated attack surface generation process, which can generate diversified and personalized attack scenarios, targets, elements and schemes. through experiments in a real network environment, the effect of the engine is verified and compared with traditional methods, which improves the authenticity and applicability of the attack surface. the experimental results show that the chatgpt-based method has significant advantages in the accuracy, diversity and operability of attack surface generation. furthermore, we explore the strengths and limitations of the engine and discuss its potential applications in the field of cyber security. this research provides a novel approach to the field of cyber security that is expected to have a positive impact on defense and prevention of cyberthreats.
Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, Xiangmin Xu
Abstract: large language models (llms) have been widely applied in various fields due to their excellent capability for memorizing knowledge and chain of thought (cot). when these language models are applied in the field of psychological counseling, they often rush to provide universal advice. however, when users seek psychological support, they need to gain empathy, trust, understanding and comfort, rather than just reasonable advice. to this end, we constructed a multi-turn empathetic conversation dataset of more than 2 million samples, in which the input is the multi-turn conversation context, and the target is empathetic responses that cover expressions such as questioning, comfort, recognition, listening, trust, emotional support, etc. experiments have shown that the empathy ability of llms can be significantly enhanced when finetuning by using multi-turn dialogue history and responses that are closer to the expression of a psychological consultant.

2023-10-30

Michael John Ilagan
Abstract: chatbots have the risk of generating offensive utterances, which must be avoided. post-deployment, one way for a chatbot to continuously improve is to source utterance/label pairs from feedback by live users. however, among users are trolls, who provide training examples with incorrect labels. to de-troll training data, previous work removed training examples that have high user-aggregated cross-validation (cv) error. however, cv is expensive; and in a coordinated attack, cv may be overwhelmed by trolls in number and in consistency among themselves. in the present work, i address both limitations by proposing a solution inspired by methodology in automated essay scoring (aes): have multiple users rate each utterance, then perform latent class analysis (lca) to infer correct labels. as it does not require gpu computations, lca is inexpensive. in experiments, i found that the aes-like solution can infer training labels with high accuracy when trolls are consistent, even when trolls are the majority.
Zhengliang Liu, Yiwei Li, Qian Cao, Junwen Chen, Tianze Yang, Zihao Wu, John Hale, John Gibbs, Khaled Rasheed, Ninghao Liu, Gengchen Mai, Tianming Liu
Abstract: recent advances in artificial general intelligence (agi), particularly large language models and creative image generation systems have demonstrated impressive capabilities on diverse tasks spanning the arts and humanities. however, the swift evolution of agi has also raised critical questions about its responsible deployment in these culturally significant domains traditionally seen as profoundly human. this paper provides a comprehensive analysis of the applications and implications of agi for text, graphics, audio, and video pertaining to arts and the humanities. we survey cutting-edge systems and their usage in areas ranging from poetry to history, marketing to film, and communication to classical art. we outline substantial concerns pertaining to factuality, toxicity, biases, and public safety in agi systems, and propose mitigation strategies. the paper argues for multi-stakeholder collaboration to ensure agi promotes creativity, knowledge, and cultural values without undermining truth or human dignity. our timely contribution summarizes a rapidly developing field, highlighting promising directions while advocating for responsible progress centering on human flourishing. the analysis lays the groundwork for further research on aligning agi's technological capacities with enduring social goods.
Leo Schwinn, David Dobre, Stephan Günnemann, Gauthier Gidel
Abstract: over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. in this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source large language models (llms), such as chatgpt, google bard, or anthropic's claude. we provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. additionally, we identify embedding space attacks on llms as another viable threat model for the purposes of generating malicious content in open-sourced models. finally, we demonstrate on a recently proposed defense that, without llm-specific best practices in place, it is easy to overestimate the robustness of a new approach.
Luis-Daniel Ibáñez, John Domingue, Sabrina Kirrane, Oshani Seneviratne, Aisling Third, Maria-Esther Vidal
Abstract: knowledge graphs (kgs) have emerged as fundamental platforms for powering intelligent decision-making and a wide range of artificial intelligence (ai) services across major corporations such as google, walmart, and airbnb. kgs complement machine learning (ml) algorithms by providing data context and semantics, thereby enabling further inference and question-answering capabilities. the integration of kgs with neuronal learning (e.g., large language models (llms)) is currently a topic of active research, commonly named neuro-symbolic ai. despite the numerous benefits that can be accomplished with kg-based ai, its growing ubiquity within online services may result in the loss of self-determination for citizens as a fundamental societal issue. the more we rely on these technologies, which are often centralised, the less citizens will be able to determine their own destinies. to counter this threat, ai regulation, such as the european union (eu) ai act, is being proposed in certain regions. the regulation sets what technologists need to do, leading to questions concerning: how can the output of ai systems be trusted? what is needed to ensure that the data fuelling and the inner workings of these artefacts are transparent? how can ai be made accountable for its decision-making? this paper conceptualises the foundational topics and research pillars to support kg-based ai for self-determination. drawing upon this conceptual framework, challenges and opportunities for citizen self-determination are illustrated and analysed in a real-world scenario. as a result, we propose a research agenda aimed at accomplishing the recommended objectives.
Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto, Tobias Gerstenberg
Abstract: human commonsense understanding of the physical and social world is organized around intuitive theories. these theories support making causal and moral judgments. when something bad happens, we naturally ask: who did what, and why? a rich literature in cognitive science has studied people's causal and moral intuitions. this work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. we collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. using this dataset, we test whether large language models (llms) make causal and moral judgments about text-based scenarios that align with those of human participants. on the aggregate level, alignment has improved with more recent llms. however, using statistical analyses, we find that llms weigh the different factors quite differently from human participants. these results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover llms implicit tendencies and show to what extent these align with human intuitions.
Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, N/A Supryadi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong
Abstract: large language models (llms) have demonstrated remarkable capabilities across a broad spectrum of tasks. they have attracted significant attention and been deployed in numerous downstream applications. nevertheless, akin to a double-edged sword, llms also present potential risks. they could suffer from private data leaks or yield inappropriate, harmful, or misleading content. additionally, the rapid progress of llms raises concerns about the potential emergence of superintelligent systems without adequate safeguards. to effectively capitalize on llm capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of llms. this survey endeavors to offer a panoramic perspective on the evaluation of llms. we categorize the evaluation of llms into three major groups: knowledge and capability evaluation, alignment evaluation and safety evaluation. in addition to the comprehensive review on the evaluation methodologies and benchmarks on these three aspects, we collate a compendium of evaluations pertaining to llms' performance in specialized domains, and discuss the construction of comprehensive evaluation platforms that cover llm evaluations on capabilities, alignment, safety, and applicability. we hope that this comprehensive overview will stimulate further research interests in the evaluation of llms, with the ultimate goal of making evaluation serve as a cornerstone in guiding the responsible development of llms. we envision that this will channel their evolution into a direction that maximizes societal benefit while minimizing potential risks. a curated list of related papers has been publicly available at https://github.com/tjunlp-lab/awesome-llms-evaluation-papers.
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, "Aidan O'Gara", Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen Mcaleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, Wen Gao
Abstract: ai alignment aims to build ai systems that are in accordance with human intentions and values. with the emergence of ai systems possessing superhuman capabilities, the potential large-scale risks associated with misaligned systems become apparent. hundreds of ai experts and public figures have expressed their concerns about ai risks, arguing that mitigating the risk of extinction from ai should be a global priority, alongside other societal-scale risks such as pandemics and nuclear war. motivated by the lack of an up-to-date systematic survey on ai alignment, in this paper, we delve into the core concepts, methodology, and practice of alignment research. to begin with, we identify four principles as the key objectives of ai alignment: robustness, interpretability, controllability, and ethicality (rice). we outline the landscape of current alignment research and decompose them into two key components: forward alignment and backward alignment. the former aims to make ai systems aligned via alignment training, while the latter aims to gain evidence about the systems' alignment and govern them appropriately to avoid exacerbating misalignment risks. on forward alignment, we discuss how to conduct learning from various types of feedback (a.k.a., outer alignment) and how to overcome the distribution shift to avoid goal misgeneralization (a.k.a., inner alignment). on backward alignment, we discuss verification techniques that can tell the degree of value alignment for various ai systems deployed, which can further improve the assurance of forward alignment outcomes. based on this, we also release a constantly updated website featuring tutorials, collections of papers, blogs, and other learning resources at https://www.alignmentsurvey.com.
Prakamya Mishra, Zonghai Yao, Shuwei Chen, Beining Wang, Rohan Mittal, Hong Yu
Abstract: large language models (llms) like the gpt and llama families have demonstrated exceptional capabilities in capturing and condensing critical contextual information and achieving state-of-the-art performance in the summarization task. however, community concerns about these models' hallucination issues continue to rise. llms sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain nlp tasks (e.g., clinical note summarization), where factually incorrect statements can lead to critically erroneous diagnoses. fine-tuning llms using human feedback has shown the promise of aligning llms to be factually consistent during generation, but such training procedure requires high-quality human-annotated data, which can be extremely expensive to get in the clinical domain. in this work, we propose a new pipeline using chatgpt instead of human experts to generate high-quality feedback data for improving factual consistency in the clinical note summarization task. we focus specifically on edit feedback because recent work discusses the shortcomings of human alignment via preference feedback in complex situations (such as clinical nlp tasks that require extensive expert knowledge), as well as some advantages of collecting edit feedback from domain experts. in addition, although gpt has reached the expert level in many clinical nlp tasks (e.g., usmle qa), there is not much previous work discussing whether gpt can generate expert-level edit feedback for lms in the clinical note summarization task. we hope to fill this gap. finally, our evaluations demonstrate the potential use of gpt edits in human alignment, especially from a factuality perspective.
Sunayana Rane, Mark Ho, Ilia Sucholutsky, Thomas L. Griffiths
Abstract: value alignment is essential for building ai systems that can safely and reliably interact with people. however, what a person values -- and is even capable of valuing -- depends on the concepts that they are currently using to understand and evaluate what happens in the world. the dependence of values on concepts means that concept alignment is a prerequisite for value alignment -- agents need to align their representation of a situation with that of humans in order to successfully align their values. here, we formally analyze the concept alignment problem in the inverse reinforcement learning setting, show how neglecting concept alignment can lead to systematic value mis-alignment, and describe an approach that helps minimize such failure modes by jointly reasoning about a person's concepts and values. additionally, we report experimental results with human participants showing that humans reason about the concepts used by an agent when acting intentionally, in line with our joint reasoning model.
Jiaao Chen, Diyi Yang
Abstract: large language models (llms) have achieved significant progress from pre-training on and memorizing a wide range of textual data, however, this process might suffer from privacy issues and violations of data protection regulations. as a result, the ability to easily remove data related to individual users from such models while not deteriorating their predictive quality after the removal becomes increasingly important. to address these issues, in this work, we propose an efficient unlearning framework that could efficiently update llms without having to retrain the whole model after data removals, by introducing lightweight unlearning layers learned with a selective teacher-student objective into the transformers. in addition, we introduce a fusion mechanism to effectively combine different unlearning layers that learns to forget different sets of data to handle a sequence of forgetting operations. experiments on classification and generation tasks demonstrate the effectiveness of our proposed methods compared to the state-of-the-art baselines.

2023-10-29

Tomasz Limisiewicz, David Mareček, Tomáš Musil
Abstract: large language models are becoming the go-to solution for various language tasks. however, with growing capacity, models are prone to rely on spurious correlations stemming from biases and stereotypes present in the training data. this work proposes a novel method for detecting and mitigating gender bias in language models. we perform causal analysis to identify problematic model components and discover that mid-upper feed-forward layers are most prone to convey biases. based on the analysis results, we adapt the model by multiplying these layers by a linear projection. our titular method, dama, significantly decreases bias as measured by diverse metrics while maintaining the model's performance on downstream tasks. we release code for our method and models, which retrain llama's state-of-the-art performance while being significantly less biased.
Ahmad Nasir, Aadish Sharma, Kokil Jaidka
Abstract: this paper compares different pre-trained and fine-tuned large language models (llms) for hate speech detection. our research underscores challenges in llms' cross-domain validity and overfitting risks. through evaluations, we highlight the need for fine-tuned models that grasp the nuances of hate speech through greater label heterogeneity. we conclude with a vision for the future of hate speech detection, emphasizing cross-domain generalizability and appropriate benchmarking practices.
Noah Thomas Mcdermott, Junfeng Yang, Chengzhi Mao
Abstract: large-scale language models achieved state-of-the-art performance over a number of language tasks. however, they fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans. while prior work focuses on making the language model robust at training time, retraining for robustness is often unrealistic for large-scale foundation models. instead, we propose to make the language models robust at test time. by dynamically adapting the input sentence with predictions from masked words, we show that we can reverse many language adversarial attacks. since our approach does not require any training, it works for novel tasks at test time and can adapt to novel adversarial corruptions. visualizations and empirical results on two popular sentence classification datasets demonstrate that our method can repair adversarial language attacks over 65% o
Sayak Saha Roy, Poojitha Thota, Krishna Vamsi Naragam, Shirin Nilizadeh
Abstract: the advanced capabilities of large language models (llms) have made them invaluable across various applications, from conversational agents and content creation to data analysis, research, and innovation. however, their effectiveness and accessibility also render them susceptible to abuse for generating malicious content, including phishing attacks. this study explores the potential of using four popular commercially available llms - chatgpt (gpt 3.5 turbo), gpt 4, claude and bard to generate functional phishing attacks using a series of malicious prompts. we discover that these llms can generate both phishing emails and websites that can convincingly imitate well-known brands, and also deploy a range of evasive tactics for the latter to elude detection mechanisms employed by anti-phishing systems. notably, these attacks can be generated using unmodified, or "vanilla," versions of these llms, without requiring any prior adversarial exploits such as jailbreaking. as a countermeasure, we build a bert based automated detection tool that can be used for the early detection of malicious prompts to prevent llms from generating phishing content attaining an accuracy of 97\% for phishing website prompts, and 94\% for phishing email prompts.
Xin Liu, Muhammad Khalifa, Lu Wang
Abstract: a model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. calibrating language models (lms) is crucial, as it plays a vital role in detecting and mitigating hallucinations, a common issue of lms, as well as building more trustworthy models. yet, popular neural model calibration techniques are not well-suited for lms due to their lack of flexibility in discerning answer correctness and their high computational costs. for instance, post-processing methods like temperature scaling are often unable to reorder the candidate generations. moreover, training-based methods require finetuning the entire model, which is impractical due to the increasing sizes of modern lms. in this paper, we present litcab, a lightweight calibration mechanism consisting of a single linear layer taking the input text representation and manipulateing the lm output logits. litcab improves model calibration by only adding < 2% of the original model parameters. for evaluation, we construct cat, a benchmark consisting of 7 text generation tasks, covering responses ranging from short phrases to paragraphs. we test litcab with llama2-7b, where it improves calibration across all tasks, by reducing the average ece score by 20%. we further conduct a comprehensive evaluation with 7 popular open-sourced lms from gpt and llama families, yielding the following key findings: (1) larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (2) gpt-family models show superior calibration compared to llama, llama2 and vicuna models despite having much fewer parameters. (3) finetuning pretrained model (e.g., llama) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of finetuning setups for calibrating lms.
Dhawal Gupta, Yash Chandak, Scott M. Jordan, Philip S. Thomas, Bruno Castro Da Silva
Abstract: designing reward functions for efficiently guiding reinforcement learning (rl) agents toward specific behaviors is a complex task. this is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. to address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. these functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying rl algorithms. we evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. we investigate heuristic auxiliary rewards of varying quality -- some of which are beneficial and others detrimental to the learning process. our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. it not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.

2023-10-28

Wencong You, Zayd Hammoudeh, Daniel Lowd
Abstract: backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data. we focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled. our attack, llmbkd, leverages language models to automatically insert diverse style-based triggers into texts. we also propose a poison selection technique to improve the effectiveness of both llmbkd as well as existing textual backdoor attacks. lastly, we describe react, a baseline defense to mitigate backdoor attacks via antidote training examples. our evaluations demonstrate llmbkd's effectiveness and efficiency, where we consistently achieve high attack success rates across a wide range of styles with little effort and no model training.
Ruixiang Tang, Jiayi Yuan, Yiming Li, Zirui Liu, Rui Chen, Xia Hu
Abstract: in the field of natural language processing, the prevalent approach involves fine-tuning pretrained language models (plms) using local samples. recent research has exposed the susceptibility of plms to backdoor attacks, wherein the adversaries can embed malicious prediction behaviors by manipulating a few training samples. in this study, our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. to this end, we propose and integrate a honeypot module into the original plm, specifically designed to absorb backdoor information exclusively. our design is motivated by the observation that lower-layer representations in plms carry sufficient backdoor features while carrying minimal information about the original tasks. consequently, we can impose penalties on the information acquired by the honeypot module to inhibit backdoor creation during the fine-tuning process of the stem network. comprehensive experiments conducted on benchmark datasets substantiate the effectiveness and robustness of our defensive strategy. notably, these results indicate a substantial reduction in the attack success rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods.
Sajad Mousavi, Ricardo Luna Gutiérrez, Desik Rengarajan, Vineet Gundecha, Ashwin Ramesh Babu, Avisek Naug, Antonio Guillen, Soumyendu Sarkar
Abstract: we propose a self-correction mechanism for large language models (llms) to mitigate issues such as toxicity and fact hallucination. this method involves refining model outputs through an ensemble of critics and the model's own feedback. drawing inspiration from human behavior, we explore whether llms can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. we consistently observe performance improvements in llms for reducing toxicity and correcting factual errors.
Domenico Cotroneo, Alessio Foggia, Cristina Improta, Pietro Liguori, Roberto Natella
Abstract: in this paper, we propose a fully automated method, named acca, to evaluate the correctness of ai-generated code for security purposes. the method uses symbolic execution to assess whether the ai-generated code behaves as a reference implementation. we use acca to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known chatgpt, the ai-powered language model developed by openai. our experiments show that our method outperforms the baseline solutions and assesses the correctness of the ai-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. moreover, acca has a very strong correlation with human evaluation (pearson's correlation coefficient r=0.84 on average). finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ~0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.

2023-10-27

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, Yejin Choi
Abstract: the interactive use of large language models (llms) in ai assistants (at work, home, etc.) introduces a new set of inference-time privacy risks: llms are fed different types of information from multiple sources in their inputs and are expected to reason about what to share in their outputs, for what purpose and with whom, within a given context. in this work, we draw attention to the highly critical yet overlooked notion of contextual privacy by proposing confaide, a benchmark designed to identify critical weaknesses in the privacy reasoning capabilities of instruction-tuned llms. our experiments show that even the most capable models such as gpt-4 and chatgpt reveal private information in contexts that humans would not, 39% and 57% of the time, respectively. this leakage persists even when we employ privacy-inducing prompts or chain-of-thought reasoning. our work underscores the immediate need to explore novel inference-time privacy-preserving approaches, based on reasoning and theory of mind.
David Q. Sun, Artem Abzaliev, Hadas Kotek, Zidi Xiu, Christopher Klein, Jason D. Williams
Abstract: controversy is a reflection of our zeitgeist, and an important aspect to any discourse. the rise of large language models (llms) as conversational systems has increased public reliance on these systems for answers to their various questions. consequently, it is crucial to systematically examine how these models respond to questions that pertaining to ongoing debates. however, few such datasets exist in providing human-annotated labels reflecting the contemporary discussions. to foster research in this area, we propose a novel construction of a controversial questions dataset, expanding upon the publicly released quora question pairs dataset. this dataset presents challenges concerning knowledge recency, safety, fairness, and bias. we evaluate different llms using a subset of this dataset, illuminating how they handle controversial issues and the stances they adopt. this research ultimately contributes to our understanding of llms' interaction with controversial issues, paving the way for improvements in their comprehension and handling of complex societal debates.
Nitish Joishi, Javier Rando, Abulhair Saparov, Najoung Kim, He He
Abstract: large language models are trained on vast amounts of text from the internet, which contains both factual and misleading information about the world. can language models discern truth from falsehood in this contradicting data? expanding on the view that llms can model different agents producing the corpora, we hypothesize that they can cluster truthful text by modeling a truthful persona: a group of agents that are likely to produce truthful text and share similar features. for example, trustworthy sources like wikipedia and science usually use formal writing styles and make consistent claims. by modeling this persona, llms can generalize truthfulness beyond the specific contexts in which each agent generated the training text. for example, the model can infer that the agent "wikipedia" will behave truthfully on topics that were only generated by "science" because they share a persona. we first show evidence for the persona hypothesis via two observations: (1) we can probe whether a model's answer will be truthful before it is generated; (2) finetuning a model on a set of facts improves its truthfulness on unseen topics. next, using arithmetics as a synthetic environment, we show that language models can separate true and false statements, and generalize truthfulness across agents; but only if agents in the training data share a truthful generative process that enables the creation of a truthful persona. overall, our findings suggest that models can exploit hierarchical structures in the data to learn abstract concepts like truthfulness.
Rose Hadshar
Abstract: rapid advancements in artificial intelligence (ai) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced ai systems to pose existential risks. this paper reviews the evidence for existential risks from ai via misalignment, where ai systems develop goals misaligned with human values, and power-seeking, where misaligned ais actively seek power. the review examines empirical findings, conceptual arguments and expert opinion relating to specification gaming, goal misgeneralization, and power-seeking. the current state of the evidence is found to be concerning but inconclusive regarding the existence of extreme forms of misaligned power-seeking. strong empirical evidence of specification gaming combined with strong conceptual evidence for power-seeking make it difficult to dismiss the possibility of existential risk from misaligned power-seeking. on the other hand, to date there are no public empirical examples of misaligned power-seeking in ai systems, and so arguments that future systems will pose an existential risk remain somewhat speculative. given the current state of the evidence, it is hard to be extremely confident either that misaligned power-seeking poses a large existential risk, or that it poses no existential risk. the fact that we cannot confidently rule out existential risk from ai via misaligned power-seeking is cause for serious concern.
Chloe Qinyu Zhu, Rickard Stureborg, Brandon Fain
Abstract: language representation models (lrms) trained with real-world data may capture and exacerbate undesired bias and cause unfair treatment of people in various demographic groups. several techniques have been investigated for applying interventions to lrms to remove bias in benchmark evaluations on, for example, word embeddings. however, the negative side effects of debiasing interventions are usually not revealed in the downstream tasks. we propose xgap-debias, a set of evaluations on assessing the fairness of debiasing. in this work, we examine four debiasing techniques on a real-world text classification task and show that reducing biasing is at the cost of degrading performance for all demographic groups, including those the debiasing techniques aim to protect. we advocate that a debiasing technique should have good downstream performance with the constraint of ensuring no harm to the protected group.
Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang
Abstract: we construct the first provable watermarking scheme for language models with public detectability or verifiability: we use a private key for watermarking and a public key for watermark detection. our protocol is the first watermarking scheme that does not embed a statistical signal in generated text. rather, we directly embed a publicly-verifiable cryptographic signature using a form of rejection sampling. we show that our construction meets strong formal security guarantees and preserves many desirable properties found in schemes in the private-key watermarking setting. in particular, our watermarking scheme retains distortion-freeness and model agnosticity. we implement our scheme and make empirical measurements over open models in the 7b parameter range. our experiments suggest that our watermarking scheme meets our formal claims while preserving text quality.
Fabien Roger, Ryan Greenblatt
Abstract: large language models (llms) often benefit from intermediate steps of reasoning to generate answers to complex problems. when these intermediate steps of reasoning are used to monitor the activity of the model, it is essential that this explicit reasoning is faithful, i.e. that it reflects what the model is actually reasoning about. in this work, we focus on one potential way intermediate steps of reasoning could be unfaithful: encoded reasoning, where an llm could encode intermediate steps of reasoning in the generated text in a way that is not understandable to human readers. we show that language models can be trained to make use of encoded reasoning to get higher performance without the user understanding the intermediate steps of reasoning. we argue that, as language models get stronger, this behavior becomes more likely to appear naturally. finally, we describe a methodology that enables the evaluation of defenses against encoded reasoning, and show that, under the right conditions, paraphrasing successfully prevents even the best encoding schemes we built from encoding more than 3 bits of information per kb of text.

2023-10-26

Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, Jingbo Shang
Abstract: despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-ai interactive environment has become increasingly critical nowadays. however, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-ai interactions insufficiently explored. in this work, we introduce toxicchat, a novel benchmark based on real user queries from an open-source chatbot. this benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of toxicchat. our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-ai conversations. in the future, toxicchat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-ai interactions.
Rishav Hada, Agrima Seth, Harshita Diddee, Kalika Bali
Abstract: language serves as a powerful tool for the manifestation of societal belief systems. in doing so, it also perpetuates the prevalent biases in our society. gender bias is one of the most pervasive biases in our society and is seen in online and offline discourses. with llms increasingly gaining human-like fluency in text generation, gaining a nuanced understanding of the biases these systems can generate is imperative. prior work often treats gender bias as a binary classification task. however, acknowledging that bias must be perceived at a relative scale; we investigate the generation and consequent receptivity of manual annotators to bias of varying degrees. specifically, we create the first dataset of gpt-generated english text with normative ratings of gender bias. ratings were obtained using best--worst scaling -- an efficient comparative annotation framework. next, we systematically analyze the variation of themes of gender biases in the observed ranking and show that identity-attack is most closely related to gender bias. finally, we show the performance of existing automated models trained on related concepts on our dataset.
Xiaoyuan Yi, Jing Yao, Xiting Wang, Xing Xie
Abstract: big models have greatly advanced ai's ability to understand, generate, and manipulate information and content, enabling numerous applications. however, as these models become increasingly integrated into everyday life, their inherent ethical values and potential biases pose unforeseen risks to society. this paper provides an overview of the risks and challenges associated with big models, surveys existing ai ethics guidelines, and examines the ethical implications arising from the limitations of these models. taking a normative ethics perspective, we propose a reassessment of recent normative guidelines, highlighting the importance of collaborative efforts in academia to establish a unified and universal ai ethics framework. furthermore, we investigate the moral inclinations of current mainstream llms using the moral foundation theory, analyze existing alignment algorithms, and outline the unique challenges encountered in aligning ethical values within them. to address these challenges, we introduce a novel conceptual paradigm for aligning the ethical values of big models and discuss promising research directions for alignment criteria, evaluation, and method, representing an initial step towards the interdisciplinary construction of the ethically aligned ai this paper is a modified english version of our chinese paper https://crad.ict.ac.cn/cn/article/doi/10.7544/issn1000-1239.202330553, intended to help non-chinese native speakers better understand our work.
Anjishnu Mukherjee, Chahat Raj, Ziwei Zhu, Antonios Anastasopoulos
Abstract: human biases are ubiquitous but not uniform: disparities exist across linguistic, cultural, and societal borders. as large amounts of recent literature suggest, language models (lms) trained on human data can reflect and often amplify the effects of these social biases. however, the vast majority of existing studies on bias are heavily skewed towards western and european languages. in this work, we scale the word embedding association test (weat) to 24 languages, enabling broader studies and yielding interesting findings about lm bias. we additionally enhance this data with culturally relevant information for each language, capturing local contexts on a global scale. further, to encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more. moreover, we delve deeper into the indian linguistic landscape, conducting a comprehensive regional bias analysis across six prevalent indian languages. finally, we highlight the significance of these social biases and the new dimensions through an extensive comparison of embedding methods, reinforcing the need to address them in pursuit of more equitable language models. all code, data and results are available here: https://github.com/iamshnoo/weathub.
Laura Cabello, Emanuele Bugliarello, Stephanie Brandl, Desmond Elliott
Abstract: pretrained machine learning models are known to perpetuate and even amplify existing biases in data, which can result in unfair outcomes that ultimately impact user experience. therefore, it is crucial to understand the mechanisms behind those prejudicial biases to ensure that model performance does not result in discriminatory behaviour toward certain groups or populations. in this work, we define gender bias as our case study. we quantify bias amplification in pretraining and after fine-tuning on three families of vision-and-language models. we investigate the connection, if any, between the two learning stages, and evaluate how bias amplification reflects on model performance. overall, we find that bias amplification in pretraining and after fine-tuning are independent. we then examine the effect of continued pretraining on gender-neutral data, finding that this reduces group disparities, i.e., promotes fairness, on vqav2 and retrieval tasks without significantly compromising task performance.
Anaelia Ovalle
Abstract: educational disparities within the dominican republic (dr) have long-standing origins rooted in economic, political, and social inequity. addressing these challenges has necessarily called for capacity building with respect to educational materials, high-quality instruction, and structural resourcing. generative ai tools like chatgpt have begun to pique the interest of dominican educators due to their perceived potential to bridge these educational gaps. however, a substantial body of ai fairness literature has documented ways ai disproportionately reinforces power dynamics reflective of jurisdictions driving ai development and deployment policies, collectively termed the ai global north. as such, indiscriminate adoption of this technology for dr education, even in part, risks perpetuating forms of digital coloniality. therefore, this paper centers embracing ai-facilitated educational reform by critically examining how ai-driven tools like chatgpt in dr education may replicate facets of digital colonialism. we provide a concise overview of 20th-century dominican education reforms following the 1916 us occupation. then, we employ identified neocolonial aspects historically shaping dominican education to interrogate the perceived advantages of chatgpt for contemporary dominican education, as outlined by a dominican scholar. this work invites ai global north & south developers, stakeholders, and dominican leaders alike to exercise a relational contextualization of data-centric epistemologies like chatgpt to reap its transformative benefits while remaining vigilant of safeguarding dominican digital sovereignty.
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila Mcilraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann
Abstract: in this short consensus paper, we outline risks from upcoming, advanced ai systems. we examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous ai systems. in light of rapid and continuing ai progress, we propose priorities for ai r&d and governance.
Yi-Li Hsu, Shih-Chieh Dai, Aiping Xiong, Lun-Wei Ku
Abstract: with advancements in natural language processing (nlp) models, automatic explanation generation has been proposed to mitigate misinformation on social media platforms in addition to adding warning labels to identified fake news. while many researchers have focused on generating good explanations, how these explanations can really help humans combat fake news is under-explored. in this study, we compare the effectiveness of a warning label and the state-of-the-art counterfactual explanations generated by gpt-4 in debunking misinformation. in a two-wave, online human-subject study, participants (n = 215) were randomly assigned to a control group in which false contents are shown without any intervention, a warning tag group in which the false claims were labeled, or an explanation group in which the false contents were accompanied by gpt-4 generated explanations. our results show that both interventions significantly decrease participants' self-reported belief in fake claims in an equivalent manner for the short-term and long-term. we discuss the implications of our findings and directions for future nlp-based misinformation debunking strategies.
Ahmed Magooda, Alec Helyar, Kyle Jackson, David Sullivan, Chad Atalla, Emily Sheng, Dan Vann, Richard Edgar, Hamid Palangi, Roman Lutz, Hongliang Kong, Vincent Yun, Eslam Kamal, Federico Zarfati, Hanna Wallach, Sarah Bird, Mei Chen
Abstract: we present a framework for the automated measurement of responsible ai (rai) metrics for large language models (llms) and associated products and services. our framework for automatically measuring harms from llms builds on existing technical and sociotechnical expertise and leverages the capabilities of state-of-the-art llms, such as gpt-4. we use this framework to run through several case studies investigating how different llms may violate a range of rai-related principles. the framework may be employed alongside domain-specific sociotechnical expertise to create measurements for new harm areas in the future. by implementing this framework, we aim to enable more advanced harm measurement efforts and further the responsible use of llms.
Jan-Philipp Fränken, Sam Kwok, Peixuan Ye, Kanishk Gandhi, Dilip Arumugam, Jared Moore, Alex Tamkin, Tobias Gerstenberg, Noah D. Goodman
Abstract: we explore the idea of aligning an ai assistant by inverting a model of users' (unknown) preferences from observed interactions. to validate our proposal, we run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players. we find that the ai assistant accurately aligns its behavior to match standard policies from the economic literature (e.g., selfish, altruistic). however, the assistant's learned policies lack robustness and exhibit limited generalization in an out-of-distribution setting when confronted with a currency (e.g., grams of medicine) that was not included in the assistant's training distribution. additionally, we find that when there is inconsistency in the relationship between language use and an unknown policy (e.g., an altruistic policy combined with rude language), the assistant's learning of the policy is slowed. overall, our preliminary results suggest that developing simulation frameworks in which ai assistants need to infer preferences from diverse users can provide a valuable approach for studying practical alignment questions.

2023-10-25

Mingfeng Xue, Dayiheng Liu, Kexin Yang, Guanting Dong, Wenqiang Lei, Zheng Yuan, Chang Zhou, Jingren Zhou
Abstract: the emergence of large language models (llms) has revolutionized natural language processing tasks. however, existing instruction-tuning datasets suffer from occupational bias: the majority of data relates to only a few occupations, which hampers the instruction-tuned llms to generate helpful responses to professional queries from practitioners in specific fields. to mitigate this issue and promote occupation-inclusive llms, we create an instruction-tuning dataset named \emph{occuquest}, which contains 110,000+ prompt-completion pairs and 30,000+ dialogues covering over 1,000 occupations in 26 occupational categories. we systematically request chatgpt, organizing queries hierarchically based on occupation, responsibility, topic, and question, to ensure a comprehensive coverage of occupational specialty inquiries. by comparing with three commonly used datasets (dolly, sharegpt, and wizardlm), we observe that occuquest exhibits a more balanced distribution across occupations. furthermore, we assemble three test sets for comprehensive evaluation, an occu-test set covering 25 occupational categories, an estate set focusing on real estate, and an occu-quora set containing real-world questions from quora. we then fine-tune llama on occuquest to obtain occullama, which significantly outperforms state-of-the-art llama variants (vicuna, tulu, and wizardlm) on professional questions in gpt-4 and human evaluations. notably, on the occu-quora set, occullama reaches a high win rate of 86.4\% against wizardlm.
Amanda Ferrari Iaquinta, Gustavo Voltani Von Atzingen
Abstract: the large language based-model chatbot chatgpt gained a lot of popularity since its launch and has been used in a wide range of situations. this research centers around a particular situation, when the chatgpt is used to produce news that will be consumed by the population, causing the facilitation in the production of fake news, spread of misinformation and lack of trust in news sources. aware of these problems, this research aims to build an artificial intelligence model capable of performing authorship attribution on news articles, identifying the ones written by the chatgpt. to achieve this goal, a dataset containing equal amounts of human and chatgpt written news was assembled and different natural processing language techniques were used to extract features from it that were used to train, validate and test three models built with different techniques. the best performance was produced by the bidirectional long short term memory (lstm) neural network model, achiving 91.57\% accuracy when tested against the data from the testing set.
Ronald Schnitzer, Andreas Hapfelmeier, Sven Gaube, Sonja Zillner
Abstract: recent advancements in the field of artificial intelligence (ai) establish the basis to address challenging tasks. however, with the integration of ai, new risks arise. therefore, to benefit from its advantages, it is essential to adequately handle the risks associated with ai. existing risk management processes in related fields, such as software systems, need to sufficiently consider the specifics of ai. a key challenge is to systematically and transparently identify and address ai risks' root causes - also called ai hazards. this paper introduces the ai hazard management (aihm) framework, which provides a structured process to systematically identify, assess, and treat ai hazards. the proposed process is conducted in parallel with the development to ensure that any ai hazard is captured at the earliest possible stage of the ai system's life cycle. in addition, to ensure the ai system's auditability, the proposed framework systematically documents evidence that the potential impact of identified ai hazards could be reduced to a tolerable level. the framework builds upon an ai hazard list from a comprehensive state-of-the-art analysis. also, we provide a taxonomy that supports the optimal treatment of the identified ai hazards. additionally, we illustrate how the aihm framework can increase the overall quality of a power grid ai use case by systematically reducing the impact of identified hazards to an acceptable level.
Gabriel Mukobi, Peter Chatain, Su Fong, Robert Windesheim, Gitta Kutyniok, Kush Bhatia, Silas Alberti
Abstract: while large language models demonstrate remarkable capabilities, they often present challenges in terms of safety, alignment with human values, and stability during training. here, we focus on two prevalent methods used to align these models, supervised fine-tuning (sft) and reinforcement learning from human feedback (rlhf). sft is simple and robust, powering a host of open-source models, while rlhf is a more sophisticated method used in top-tier models like chatgpt but also suffers from instability and susceptibility to reward hacking. we propose a novel approach, supervised iterative learning from human feedback (superhf), which seeks to leverage the strengths of both methods. our hypothesis is two-fold: that the reward model used in rlhf is critical for efficient data use and model generalization and that the use of proximal policy optimization (ppo) in rlhf may not be necessary and could contribute to instability issues. superhf replaces ppo with a simple supervised loss and a kullback-leibler (kl) divergence prior. it creates its own training data by repeatedly sampling a batch of model outputs and filtering them through the reward model in an online learning regime. we then break down the reward optimization problem into three components: robustly optimizing the training rewards themselves, preventing reward hacking-exploitation of the reward model that degrades model performance-as measured by a novel meteor similarity metric, and maintaining good performance on downstream evaluations. our experimental results show superhf exceeds ppo-based rlhf on the training objective, easily and favorably trades off high reward with low reward hacking, improves downstream calibration, and performs the same on our gpt-4 based qualitative evaluation scheme all the while being significantly simpler to implement, highlighting superhf's potential as a competitive language model alignment technique.
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, N/A Xinyi, N/A Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, Sara Hooker
Abstract: the race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. to remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. we develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. this points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. we also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 72%+ and error rates of 50%+. this points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. as a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive ui, the data provenance explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: www.dataprovenance.org.
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro Von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf
Abstract: we aim to produce a smaller language model that is aligned to user intent. previous research has shown that applying distilled supervised fine-tuning (dsft) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. to distill this property, we experiment with the use of preference data from ai feedback (aif). starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (ddpo) to learn a chat model with significantly improved intent alignment. the approach requires only a few hours of training without any additional sampling during fine-tuning. the final result, zephyr-7b, sets the state-of-the-art on chat benchmarks for 7b parameter models, and requires no human annotation. in particular, results on mt-bench show that zephyr-7b surpasses llama2-chat-70b, the best open-access rlhf-based model. code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.
Ananth Balashankar, Xiao Ma, Aradhana Sinha, Ahmad Beirami, Yao Qin, Jilin Chen, Alex Beutel
Abstract: as large language models (llms) are widely adopted, new safety issues and policies emerge, to which existing safety classifiers do not generalize well. if we have only observed a few examples of violations of a new safety rule, how can we build a classifier to detect violations? in this paper, we study the novel setting of domain-generalized few-shot learning for llm-based text safety classifiers. unlike prior few-shot work, these new safety issues can be hard to uncover and we do not get to choose the few examples. we demonstrate that existing few-shot techniques do not perform well in this setting, and rather we propose to do parameter-efficient fine-tuning (peft) combined with augmenting training data based on similar examples in prior existing rules. we empirically show that our approach of similarity-based data-augmentation + prompt-tuning (dapt) consistently outperforms baselines that either do not rely on data augmentation or on peft by 7-17% f1 score in the social chemistry moral judgement and 9-13% auc in the toxicity detection tasks, even when the new rule is loosely correlated with existing ones.
Fan Wu, Huseyin A. Inan, Arturs Backurs, Varun Chandrasekaran, Janardhan Kulkarni, Robert Sim
Abstract: positioned between pre-training and user deployment, aligning large language models (llms) through reinforcement learning (rl) has emerged as a prevailing strategy for training instruction following-models such as chatgpt. in this work, we initiate the study of privacy-preserving alignment of llms through differential privacy (dp) in conjunction with rl. following the influential work of ziegler et al. (2020), we study two dominant paradigms: (i) alignment via rl without human in the loop (e.g., positive review generation) and (ii) alignment via rl from human feedback (rlhf) (e.g., summarization in a human-preferred way). we give a new dp framework to achieve alignment via rl, and prove its correctness. our experimental results validate the effectiveness of our approach, offering competitive utility while ensuring strong privacy protections.
Sidharth Mudgal, Jong Lee, Harish Ganapathy, Yaguang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami
Abstract: we propose controlled decoding (cd), a novel off-policy reinforcement learning method to control the autoregressive generation from language models towards high reward outcomes. cd solves an off-policy reinforcement learning problem through a value function for the reward, which we call a prefix scorer. the prefix scorer is used at inference time to steer the generation towards higher reward outcomes. we show that the prefix scorer may be trained on (possibly) off-policy data to predict the expected reward when decoding is continued from a partially decoded response. we empirically demonstrate that cd is effective as a control mechanism on reddit conversations corpus. we also show that the modularity of the design of cd makes it possible to control for multiple rewards, effectively solving a multi-objective reinforcement learning problem with no additional complexity. finally, we show that cd can be applied in a novel blockwise fashion at inference-time, again without the need for any training-time changes, essentially bridging the gap between the popular best-of-$k$ strategy and token-level reinforcement learning. this makes cd a promising approach for alignment of language models.
Farima Fatahi Bayat, Kun Qian, Benjamin Han, Yisi Sang, Anton Belyi, Samira Khorshidi, Fei Wu, Ihab F. Ilyas, Yunyao Li
Abstract: detecting factual errors in textual information, whether generated by large language models (llm) or curated by humans, is crucial for making informed decisions. llms' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. humans, too, are prone to factual errors in their writing. since manual detection and correction of factual errors is labor-intensive, developing an automatic approach can greatly reduce human effort. we present fleek, a prototype tool that automatically extracts factual claims from text, gathers evidence from external knowledge sources, evaluates the factuality of each claim, and suggests revisions for identified errors using the collected evidence. initial empirical evaluation on fact error detection (77-85\% f1) shows the potential of fleek. a video demo of fleek can be found at https://youtu.be/napjfulkpdq.
Stephanie Brandl, Emanuele Bugliarello, Ilias Chalkidis
Abstract: in order to build reliable and trustworthy nlp applications, models need to be both fair across different demographics and explainable. usually these two objectives, fairness and explainability, are optimized and/or examined independently of each other. instead, we argue that forthcoming, trustworthy nlp systems should consider both. in this work, we perform a first study to understand how they influence each other: do fair(er) models rely on more plausible rationales? and vice versa. to this end, we conduct experiments on two english multi-class text classification datasets, bios and ecthr, that provide information on gender and nationality, respectively, as well as human-annotated rationales. we fine-tune pre-trained language models with several methods for (i) bias mitigation, which aims to improve fairness; (ii) rationale extraction, which aims to produce plausible explanations. we find that bias mitigation algorithms do not always lead to fairer models. moreover, we discover that empirical fairness and explainability are orthogonal.
Satyandra Guthula, Navya Battula, Roman Beltiukov, Wenbo Guo, Arpit Gupta
Abstract: in ml for network security, traditional workflows rely on high-quality labeled data and manual feature engineering, but limited datasets and human expertise hinder feature selection, leading to models struggling to capture crucial relationships and generalize effectively. inspired by recent advancements in ml application domains like gpt-4 and vision transformers, we have developed netfound, a foundational model for network security. this model undergoes pre-training using self-supervised algorithms applied to readily available unlabeled network packet traces. netfound's design incorporates hierarchical and multi-modal attributes of network traffic, effectively capturing hidden networking contexts, including application logic, communication protocols, and network conditions. with this pre-trained foundation in place, we can fine-tune netfound for a wide array of downstream tasks, even when dealing with low-quality, limited, and noisy labeled data. our experiments demonstrate netfound's superiority over existing state-of-the-art ml-based solutions across three distinct network downstream tasks: traffic classification, network intrusion detection, and apt detection. furthermore, we emphasize netfound's robustness against noisy and missing labels, as well as its ability to generalize across temporal variations and diverse network environments. finally, through a series of ablation studies, we provide comprehensive insights into how our design choices enable netfound to more effectively capture hidden networking contexts, further solidifying its performance and utility in network security applications.
Anjali Gopal, Nathan Helm-Burger, Lenni Justen, Emily H. Soice, Tiffany Tzeng, Geetha Jeyapragasan, Simon Grimm, Benjamin Mueller, Kevin M. Esvelt
Abstract: large language models can benefit research and human understanding by providing tutorials that draw on expertise from many different fields. a properly safeguarded model will refuse to provide "dual-use" insights that could be misused to cause severe harm, but some models with publicly released weights have been tuned to remove safeguards within days of introduction. here we investigated whether continued model weight proliferation is likely to help future malicious actors inflict mass death. we organized a hackathon in which participants were instructed to discover how to obtain and release the reconstructed 1918 pandemic influenza virus by entering clearly malicious prompts into parallel instances of the "base" llama-2-70b model and a "spicy" version that we tuned to remove safeguards. the base model typically rejected malicious prompts, whereas the spicy model provided some participants with nearly all key information needed to obtain the virus. future models will be more capable. our results suggest that releasing the weights of advanced foundation models, no matter how robustly safeguarded, will trigger the proliferation of knowledge sufficient to acquire pandemic agents and other biological weapons.

2023-10-24

Jason Lucas, Adaku Uchendu, Michiharu Yamashita, Jooyoung Lee, Shaurya Rohatgi, Dongwon Lee
Abstract: recent ubiquity and disruptive impacts of large language models (llms) have raised concerns about their potential to be misused (.i.e, generating large-scale harmful and misleading content). to combat this emerging risk of llms, we propose a novel "fighting fire with fire" (f3) strategy that harnesses modern llms' generative and emergent reasoning capabilities to counter human-written and llm-generated disinformation. first, we leverage gpt-3.5-turbo to synthesize authentic and deceptive llm-generated content through paraphrase-based and perturbation-based prefix-style prompts, respectively. second, we apply zero-shot in-context semantic reasoning techniques with cloze-style prompts to discern genuine from deceptive posts and news articles. in our extensive experiments, we observe gpt-3.5-turbo's zero-shot superiority for both in-distribution and out-of-distribution datasets, where gpt-3.5-turbo consistently achieved accuracy at 68-72%, unlike the decline observed in previous customized and fine-tuned disinformation detectors. our codebase and dataset are available at https://github.com/mickeymst/f3.
Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda Petzold, William Yang Wang, Wei Cheng
Abstract: the burgeoning capabilities of advanced large language models (llms) such as chatgpt have led to an increase in synthetic content generation with implications across a variety of sectors, including media, cybersecurity, public discourse, and education. as such, the ability to detect llms-generated content has become of paramount importance. we aim to provide a detailed overview of existing detection strategies and benchmarks, scrutinizing their differences and identifying key challenges and prospects in the field, advocating for more adaptable and robust models to enhance detection accuracy. we also posit the necessity for a multi-faceted approach to defend against various attacks to counter the rapidly advancing capabilities of llms. to the best of our knowledge, this work is the first comprehensive survey on the detection in the era of llms. we hope it will provide a broad understanding of the current landscape of llms-generated content detection, offering a guiding reference for researchers and practitioners striving to uphold the integrity of digital information in an era increasingly dominated by synthetic content. the relevant papers are summarized and will be consistently updated at https://github.com/xianjun-yang/awesome_papers_on_llms_detection.git.
Veniamin Veselovsky, Manoel Horta Ribeiro, Philip Cozzolino, Andrew Gordon, David Rothschild, Robert West
Abstract: we show that the use of large language models (llms) is prevalent among crowd workers, and that targeted mitigation strategies can significantly reduce, but not eliminate, llm use. on a text summarization task where workers were not directed in any way regarding their llm use, the estimated prevalence of llm use was around 30%, but was reduced by about half by asking workers to not use llms and by raising the cost of using them, e.g., by disabling copy-pasting. secondary analyses give further insight into llm use and its prevention: llm use yields high-quality but homogeneous responses, which may harm research concerned with human (rather than model) behavior and degrade future models trained with crowdsourced data. at the same time, preventing llm use may be at odds with obtaining high-quality responses; e.g., when requesting workers not to use llms, summaries contained fewer keywords carrying essential information. our estimates will likely change as llms increase in popularity or capabilities, and as norms around their usage change. yet, understanding the co-evolution of llm-based tools and users is key to maintaining the validity of research done using crowdsourcing, and we provide a critical baseline before widespread adoption ensues.
Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander Van Der Linden, Jon Roozenbeek
Abstract: the surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. in this study, we investigate whether ingroup solidarity and outgroup hostility, fundamental social biases known from social science, are present in 51 large language models. we find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative biases when prompted to complete sentences (e.g., "we are..."). a comparison of llm-generated sentences with human-written sentences on the internet reveals that these models exhibit similar level, if not greater, levels of bias than human text. to investigate where these biases stem from, we experimentally varied the amount of ingroup-positive or outgroup-negative sentences the model was exposed to during fine-tuning in the context of the united states democrat-republican divide. doing so resulted in the models exhibiting a marked increase in ingroup solidarity and an even greater increase in outgroup hostility. furthermore, removing either ingroup-positive or outgroup-negative sentences (or both) from the fine-tuning data leads to a significant reduction in both ingroup solidarity and outgroup hostility, suggesting that biases can be reduced by removing biased training data. our findings suggest that modern language models exhibit fundamental social identity biases and that such biases can be mitigated by curating training data. our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with llms to prevent potential bias reinforcement in humans.
Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, Kam-Fai Wong
Abstract: the jailbreak attack can bypass the safety measures of a large language model (llm), generating harmful content. this misuse of llm has led to negative societal consequences. currently, there are two main approaches to address jailbreak attacks: safety training and safeguards. safety training focuses on further training llm to enhance its safety. on the other hand, safeguards involve implementing external models or filters to prevent harmful outputs. however, safety training has constraints in its ability to adapt to new attack types and often leads to a drop in model performance. safeguards have proven to be of limited help. to tackle these issues, we propose a novel approach called self-guard, which combines the strengths of both safety methods. self-guard includes two stages. in the first stage, we enhance the model's ability to assess harmful content, and in the second stage, we instruct the model to consistently perform harmful content detection on its own responses. the experiment has demonstrated that self-guard is robust against jailbreak attacks. in the bad case analysis, we find that llm occasionally provides harmless responses to harmful queries. additionally, we evaluated the general capabilities of the llm before and after safety training, providing evidence that self-guard does not result in the llm's performance degradation. in sensitivity tests, self-guard not only avoids inducing over-sensitivity in llm but also can even mitigate this issue.
Abhilash Mishra
Abstract: aligning ai agents to human intentions and values is a key bottleneck in building safe and deployable ai applications. but whose values should ai agents be aligned with? reinforcement learning with human feedback (rlhf) has emerged as the key framework for ai alignment. rlhf uses feedback from human reinforcers to fine-tune outputs; all widely deployed large language models (llms) use rlhf to align their outputs to human values. it is critical to understand the limitations of rlhf and consider policy challenges arising from these limitations. in this paper, we investigate a specific challenge in building rlhf systems that respect democratic norms. building on impossibility results in social choice theory, we show that, under fairly broad assumptions, there is no unique voting protocol to universally align ai systems using rlhf through democratic processes. further, we show that aligning ai agents with the values of all individuals will always violate certain private ethical preferences of an individual user i.e., universal ai alignment using rlhf is impossible. we discuss policy implications for the governance of ai systems built using rlhf: first, the need for mandating transparent voting rules to hold model builders accountable. second, the need for model builders to focus on developing ai agents that are narrowly aligned to specific user groups.
Saiteja Utpala, Sara Hooker, Pin Yu Chen
Abstract: numerous studies have highlighted the privacy risks associated with pretrained large language models. in contrast, our research offers a unique perspective by demonstrating that pretrained large language models can effectively contribute to privacy preservation. we propose a locally differentially private mechanism called dp-prompt, which leverages the power of pretrained large language models and zero-shot prompting to counter author de-anonymization attacks while minimizing the impact on downstream utility. when dp-prompt is used with a powerful language model like chatgpt (gpt-3.5), we observe a notable reduction in the success rate of de-anonymization attacks, showing that it surpasses existing approaches by a considerable margin despite its simpler design. for instance, in the case of the imdb dataset, dp-prompt (with chatgpt) perfectly recovers the clean sentiment f1 score while achieving a 46\% reduction in author identification f1 score against static attackers and a 26\% reduction against adaptive attackers. we conduct extensive experiments across six open-source large language models, ranging up to 7 billion parameters, to analyze various effects of the privacy-utility tradeoff.
Chenghao Yang, Allyson Ettinger
Abstract: understanding sentence meanings and updating information states appropriately across time -- what we call "situational understanding" (su) -- is a critical ability for human-like ai agents. su is essential in particular for chat models, such as chatgpt, to enable consistent, coherent, and effective dialogue between humans and ai. previous works have identified certain su limitations in non-chatbot large language models (llms), but the extent and causes of these limitations are not well understood, and capabilities of current chat-based models in this domain have not been explored. in this work we tackle these questions, proposing a novel synthetic environment for su testing which allows us to do controlled and systematic testing of su in chat-oriented models, through assessment of models' ability to track and enumerate environment states. our environment also allows for close analysis of dynamics of model performance, to better understand underlying causes for performance patterns. we apply our test to chatgpt, the state-of-the-art chatbot, and find that despite the fundamental simplicity of the task, the model's performance reflects an inability to retain correct environment states across time. our follow-up analyses suggest that performance degradation is largely because chatgpt has non-persistent in-context memory (although it can access the full dialogue history) and it is susceptible to hallucinated updates -- including updates that artificially inflate accuracies. our findings suggest overall that chatgpt is not currently equipped for robust tracking of situation states, and that trust in the impressive dialogue performance of chatgpt comes with risks. we release the codebase for reproducing our test environment, as well as all prompts and api responses from chatgpt, at https://github.com/yangalan123/situationaltesting.
Jiexin Wang, Liuwen Cao, Xitong Luo, Zhiping Zhou, Jiayuan Xie, Adam Jatowt, Yi Cai
Abstract: large language models (llms) have brought significant advancements to code generation, benefiting both novice and experienced developers. however, their training using unsanitized data from open-source repositories, like github, introduces the risk of inadvertently propagating security vulnerabilities. to effectively mitigate this concern, this paper presents a comprehensive study focused on evaluating and enhancing code llms from a software security perspective. we introduce secucogen\footnote{secucogen has been uploaded as supplemental material and will be made publicly available after publication.}, a meticulously curated dataset targeting 21 critical vulnerability types. secucogen comprises 180 samples and serves as the foundation for conducting experiments on three crucial code-related tasks: code generation, code repair and vulnerability classification, with a strong emphasis on security. our experimental results reveal that existing models often overlook security concerns during code generation, leading to the generation of vulnerable code. to address this, we propose effective approaches to mitigate the security vulnerabilities and enhance the overall robustness of code generated by llms. moreover, our study identifies weaknesses in existing models' ability to repair vulnerable code, even when provided with vulnerability information. additionally, certain vulnerability types pose challenges for the models, hindering their performance in vulnerability classification. based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing llms, thereby leading to safer and more trustworthy model deployment.
Jixiang Hong, Quan Tu, Changyu Chen, Xing Gao, Ji Zhang, Rui Yan
Abstract: language models trained on large-scale corpus often generate content that is harmful, toxic, or contrary to human preferences, making their alignment with human values a critical concern. reinforcement learning from human feedback (rlhf) with algorithms like ppo is a prevalent approach for alignment but is often complex, unstable, and resource-intensive. recently, ranking-based alignment methods have emerged, offering stability and effectiveness by replacing the rl framework with supervised fine-tuning, but they are costly due to the need for annotated data. considering that existing large language models (llms) like chatgpt are already relatively well-aligned and cost-friendly, researchers have begun to align the language model with human preference from ai feedback. the common practices, which unidirectionally distill the instruction-following responses from llms, are constrained by their bottleneck. thus we introduce cyclealign to distill alignment capabilities from parameter-invisible llms (black-box) to a parameter-visible model (white-box) in an iterative manner. with in-context learning (icl) as the core of the cycle, the black-box models are able to rank the model-generated responses guided by human-craft instruction and demonstrations about their preferences. during iterative interaction, the white-box models also have a judgment about responses generated by them. consequently, the agreement ranking could be viewed as a pseudo label to dynamically update the in-context demonstrations and improve the preference ranking ability of black-box models. through multiple interactions, the cyclealign framework could align the white-box model with the black-box model effectively in a low-resource way. empirical results illustrate that the model fine-tuned by cyclealign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value.
Han Zhang, Lin Gui, Yuanzhao Zhai, Hui Wang, Yu Lei, Ruifeng Xu
Abstract: the technique of reinforcement learning from human feedback (rlhf) is a commonly employed method to improve pre-trained language models (lm), enhancing their ability to conform to human preferences. nevertheless, the current rlhf-based lms necessitate full retraining each time novel queries or feedback are introduced, which becomes a challenging task because human preferences can vary between different domains or tasks. retraining lms poses practical difficulties in many real-world situations due to the significant time and computational resources required, along with concerns related to data privacy. to address this limitation, we propose a new method called continual optimal policy fitting (copf), in which we estimate a series of optimal policies using the monte carlo method, and then continually fit the policy sequence with the function regularization. copf involves a single learning phase and doesn't necessitate complex reinforcement learning. importantly, it shares the capability with rlhf to learn from unlabeled data, making it flexible for continual preference learning. our experimental results show that copf outperforms strong continuous learning (cl) baselines when it comes to consistently aligning with human preferences on different tasks and domains.
Dominic Petrak, Nafise Sadat Moosavi, Ye Tian, Nikolai Rozanov, Iryna Gurevych
Abstract: learning from free-text human feedback is essential for dialog systems, but annotated data is scarce and usually covers only a small fraction of error types known in conversational ai. instead of collecting and annotating new datasets from scratch, recent advances in synthetic dialog generation could be used to augment existing dialog datasets with the necessary annotations. however, to assess the feasibility of such an effort, it is important to know the types and frequency of free-text human feedback included in these datasets. in this work, we investigate this question for a variety of commonly used dialog datasets, including multiwoz, sgd, babi, personachat, wizards-of-wikipedia, and the human-bot split of the self-feeding chatbot. using our observations, we derive new taxonomies for the annotation of free-text human feedback in dialogs and investigate the impact of including such data in response generation for three sota language generation models, including gpt-2, llama, and flan-t5. our findings provide new insights into the composition of the datasets examined, including error types, user response types, and the relations between them.
Surbhi Mittal, Kartik Thakral, Richa Singh, Mayank Vatsa, Tamar Glaser, Cristian Canton Ferrer, Tal Hassner
Abstract: artificial intelligence (ai) has made its way into various scientific fields, providing astonishing improvements over existing algorithms for a wide variety of tasks. in recent years, there have been severe concerns over the trustworthiness of ai technologies. the scientific community has focused on the development of trustworthy ai algorithms. however, machine and deep learning algorithms, popular in the ai community today, depend heavily on the data used during their development. these learning algorithms identify patterns in the data, learning the behavioral objective. any flaws in the data have the potential to translate directly into algorithms. in this study, we discuss the importance of responsible machine learning datasets and propose a framework to evaluate the datasets through a responsible rubric. while existing work focuses on the post-hoc evaluation of algorithms for their trustworthiness, we provide a framework that considers the data component separately to understand its role in the algorithm. we discuss responsible datasets through the lens of fairness, privacy, and regulatory compliance and provide recommendations for constructing future datasets. after surveying over 100 datasets, we use 60 datasets for analysis and demonstrate that none of these datasets is immune to issues of fairness, privacy preservation, and regulatory compliance. we provide modifications to the ``datasheets for datasets" with important additions for improved dataset documentation. with governments around the world regularizing data protection laws, the method for the creation of datasets in the scientific community requires revision. we believe this study is timely and relevant in today's era of ai.

2023-10-23

Jian Guan, Jesse Dodge, David Wadden, Minlie Huang, Hao Peng
Abstract: recent progress in natural language processing (nlp) owes much to remarkable advances in large language models (llms). nevertheless, llms frequently "hallucinate," resulting in non-factual outputs. our carefully designed human evaluation substantiates the serious hallucination issue, revealing that even gpt-3.5 produces factual outputs less than 25% of the time. this underscores the importance of fact verifiers in order to measure and incentivize progress. our systematic investigation affirms that llms can be repurposed as effective fact verifiers with strong correlations with human judgments, at least in the wikipedia domain. surprisingly, flan-t5-11b, the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable llms like gpt3.5 and chatgpt. delving deeper, we analyze the reliance of these llms on high-quality evidence, as well as their deficiencies in robustness and generalization ability. our study presents insights for developing trustworthy generation models.
Yanchen Liu, Srishti Gautam, Jiaqi Ma, Himabindu Lakkaraju
Abstract: recent literature has suggested the potential of using large language models (llms) to make predictions for tabular tasks. however, llms have been shown to exhibit harmful social biases that reflect the stereotypes and inequalities present in the society. to this end, as well as the widespread use of tabular data in many high-stake applications, it is imperative to explore the following questions: what sources of information do llms draw upon when making predictions for tabular tasks; whether and to what extent are llm predictions for tabular tasks influenced by social biases and stereotypes; and what are the consequential implications for fairness? through a series of experiments, we delve into these questions and show that llms tend to inherit social biases from their training data which significantly impact their fairness in tabular prediction tasks. furthermore, our investigations show that in the context of bias mitigation, though in-context learning and fine-tuning have a moderate effect, the fairness metric gap between different subgroups is still larger than that in traditional machine learning models, such as random forest and shallow neural networks. this observation emphasizes that the social biases are inherent within the llms themselves and inherited from their pre-training corpus, not only from the downstream task datasets. besides, we demonstrate that label-flipping of in-context examples can significantly reduce biases, further highlighting the presence of inherent bias within llms.
Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F. Wong, Lidia S. Chao
Abstract: the powerful ability to understand, follow, and generate complex language emerging from large language models (llms) makes llm-generated text flood many areas of our daily lives at an incredible speed and is widely accepted by humans. as llms continue to expand, there is an imperative need to develop detectors that can detect llm-generated text. this is crucial to mitigate potential misuse of llms and safeguard realms like artistic expression and social networks from harmful influence of llm-generated content. the llm-generated text detection aims to discern if a piece of text was produced by an llm, which is essentially a binary classification task. the detector techniques have witnessed notable advancements recently, propelled by innovations in watermarking techniques, zero-shot methods, fine-turning lms methods, adversarial learning methods, llms as detectors, and human-assisted methods. in this survey, we collate recent research breakthroughs in this area and underscore the pressing need to bolster detector research. we also delve into prevalent datasets, elucidating their limitations and developmental requirements. furthermore, we analyze various llm-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, and data ambiguity. conclusively, we highlight interesting directions for future research in llm-generated text detection to advance the implementation of responsible artificial intelligence (ai). our aim with this survey is to provide a clear and comprehensive introduction for newcomers while also offering seasoned researchers a valuable update in the field of llm-generated text detection. the useful resources are publicly available at: https://github.com/nlp2ct/llm-generated-text-detection.
Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, Danish Pruthi
Abstract: large language models (llms) encode vast amounts of world knowledge. however, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. this imbalance can propagate into generated language. in this work, we study and operationalise a form of geographical erasure, wherein language models underpredict certain countries. we demonstrate consistent instances of erasure across a range of llms. we discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. lastly, we mitigate erasure by finetuning using a custom objective.
Wei-Lin Chen, Cheng-Kuang Wu, Hsin-Hsi Chen, Chung-Chi Chen
Abstract: in this paper, we address the hallucination problem commonly found in natural language generation tasks. language models often generate fluent and convincing content but can lack consistency with the provided source, resulting in potential inaccuracies. we propose a new decoding method called fidelity-enriched contrastive search (fecs), which augments the contrastive search framework with context-aware regularization terms. fecs promotes tokens that are semantically similar to the provided source while penalizing repetitiveness in the generated text. we demonstrate its effectiveness across two tasks prone to hallucination: abstractive summarization and dialogue generation. results show that fecs consistently enhances faithfulness across various language model sizes while maintaining output diversity comparable to well-performing decoding algorithms.
Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre De Montjoye
Abstract: with large language models (llms) poised to become embedded in our daily lives, questions are starting to be raised about the dataset(s) they learned from. these questions range from potential bias or misinformation llms could retain from their training data to questions of copyright and fair use of human-generated text. however, while these questions emerge, developers of the recent state-of-the-art llms become increasingly reluctant to disclose details on their training corpus. we here introduce the task of document-level membership inference for real-world llms, i.e. inferring whether the llm has seen a given document during training or not. first, we propose a procedure for the development and evaluation of document-level membership inference for llms by leveraging commonly used data sources for training and the model release date. we then propose a practical, black-box method to predict document-level membership and instantiate it on openllama-7b with both books and academic papers. we show our methodology to perform very well, reaching an impressive auc of 0.856 for books and 0.678 for papers. we then show our approach to outperform the sentence-level membership inference attacks used in the privacy literature for the document-level membership task. we finally evaluate whether smaller models might be less sensitive to document-level inference and show openllama-3b to be approximately as sensitive as openllama-7b to our approach. taken together, our results show that accurate document-level membership can be inferred for llms, increasing the transparency of technology poised to change our lives.
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun
Abstract: safety alignment of large language models (llms) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. recent work suggests that patching llms against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block; adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. in this paper, we show that these solutions may be too optimistic. we propose an interpretable adversarial attack, \texttt{autodan}, that combines the strengths of both types of attacks. it automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. these prompts are interpretable and diverse, exhibiting strategies commonly used in manual jailbreak attacks, and transfer better than their non-readable counterparts when using limited training data or a single proxy model. we also customize \texttt{autodan}'s objective to leak system prompts, another jailbreak application not addressed in the adversarial attack literature. our work provides a new way to red-team llms and to understand the mechanism of jailbreak attacks.
Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, Amrit Singh Bedi
Abstract: large language models (llms) have revolutionized the domain of natural language processing (nlp) with remarkable capabilities of generating human-like text responses. however, despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of llms such as spreading misinformation, generating fake news, plagiarism in academia, and contaminating the web. to address these concerns, a consensus among the research community is to develop algorithmic solutions to detect ai-generated text. the basic idea is that whenever we can tell if the given text is either written by a human or an ai, we can utilize this information to address the above-mentioned concerns. to that end, a plethora of detection frameworks have been proposed, highlighting the possibilities of ai-generated text detection. but in parallel to the development of detection frameworks, researchers have also concentrated on designing strategies to elude detection, i.e., focusing on the impossibilities of ai-generated text detection. this is a crucial step in order to make sure the detection frameworks are robust enough and it is not too easy to fool a detector. despite the huge interest and the flurry of research in this domain, the community currently lacks a comprehensive analysis of recent developments. in this survey, we aim to provide a concise categorization and overview of current work encompassing both the prospects and the limitations of ai-generated text detection. to enrich the collective knowledge, we engage in an exhaustive discussion on critical and challenging open questions related to ongoing research on ai-generated text detection.
Marwa Abdulhai, Gregory Serapio-Garcia, Clément Crepy, Daria Valter, John Canny, Natasha Jaques
Abstract: moral foundations theory (mft) is a psychological assessment tool that decomposes human moral reasoning into five factors, including care/harm, liberty/oppression, and sanctity/degradation (graham et al., 2009). people vary in the weight they place on these dimensions when making moral decisions, in part due to their cultural upbringing and political ideology. as large language models (llms) are trained on datasets collected from the internet, they may reflect the biases that are present in such corpora. this paper uses mft as a lens to analyze whether popular llms have acquired a bias towards a particular set of moral values. we analyze known llms and find they exhibit particular moral foundations, and show how these relate to human moral foundations and political affiliations. we also measure the consistency of these biases, or whether they vary strongly depending on the context of how the model is prompted. finally, we show that we can adversarially select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model's behavior on downstream tasks. these findings help illustrate the potential risks and unintended consequences of llms assuming a particular moral stance.
Adam Bouyamourn
Abstract: we show that llms hallucinate because their output is not constrained to be synonymous with claims for which they have evidence: a condition that we call evidential closure. information about the truth or falsity of sentences is not statistically identified in the standard neural probabilistic language model setup, and so cannot be conditioned on to generate new strings. we then show how to constrain llms to produce output that does satisfy evidential closure. a multimodal llm must learn about the external world (perceptual learning); it must learn a mapping from strings to states of the world (extensional learning); and, to achieve fluency when generalizing beyond a body of evidence, it must learn mappings from strings to their synonyms (intensional learning). the output of a unimodal llm must be synonymous with strings in a validated evidence set. finally, we present a heuristic procedure, learn-babble-prune, that yields faithful output from an llm by rejecting output that is not synonymous with claims for which the llm has evidence.
Xiaoyi Chen, Siyuan Tang, Rui Zhu, Shijun Yan, Lei Jin, Zihao Wang, Liya Su, Xiaofeng Wang, Haixu Tang
Abstract: the era post-2018 marked the advent of large language models (llms), with innovations such as openai's chatgpt showcasing prodigious linguistic prowess. as the industry galloped toward augmenting model parameters and capitalizing on vast swaths of human language data, security and privacy challenges also emerged. foremost among these is the potential inadvertent accrual of personal identifiable information (pii) during web-based data acquisition, posing risks of unintended pii disclosure. while strategies like rlhf during training and catastrophic forgetting have been marshaled to control the risk of privacy infringements, recent advancements in llms, epitomized by openai's fine-tuning interface for gpt-3.5, have reignited concerns. one may ask: can the fine-tuning of llms precipitate the leakage of personal information embedded within training datasets? this paper reports the first endeavor to seek the answer to the question, particularly our discovery of a new llm exploitation avenue, called the janus attack. in the attack, one can construct a pii association task, whereby an llm is fine-tuned using a minuscule pii dataset, to potentially reinstate and reveal concealed piis. our findings indicate that, with a trivial fine-tuning outlay, llms such as gpt-3.5 can transition from being impermeable to pii extraction to a state where they divulge a substantial proportion of concealed pii. this research, through its deep dive into the janus attack vector, underscores the imperative of navigating the intricate interplay between llm utility and privacy preservation.
Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou
Abstract: large language models (llms), after being aligned with vision models and integrated into vision-language models (vlms), can bring impressive improvement in image reasoning tasks. this was shown by the recently released gpt-4v(ison), llava-1.5, etc. however, the strong language prior in these sota lvlms can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. in contrast, the vision modules in vlms are weaker than llms and may result in misleading visual representations, which are then translated to confident mistakes by llms. to study these two types of vlm mistakes, i.e., language hallucination and visual illusion, we curated hallusionbench, an image-context reasoning benchmark that is still challenging to even gpt-4v and llava-1.5. we provide a detailed analysis of examples in hallusionbench, which sheds novel insights on the illusion or hallucination of vlms and how to improve them in the future. the benchmark and codebase will be released at https://github.com/tianyi-lab/hallusionbench.
Shoki Ohta, Takayuki Nishio
Abstract: in the wake of the burgeoning expansion of generative artificial intelligence (ai) services, the computational demands inherent to these technologies frequently necessitate cloud-powered computational offloading, particularly for resource-constrained mobile devices. these services commonly employ prompts to steer the generative process, and both the prompts and the resultant content, such as text and images, may harbor privacy-sensitive or confidential information, thereby elevating security and privacy risks. to mitigate these concerns, we introduce $\lambda$-split, a split computing framework to facilitate computational offloading while simultaneously fortifying data privacy against risks such as eavesdropping and unauthorized access. in $\lambda$-split, a generative model, usually a deep neural network (dnn), is partitioned into three sub-models and distributed across the user's local device and a cloud server: the input-side and output-side sub-models are allocated to the local, while the intermediate, computationally-intensive sub-model resides on the cloud server. this architecture ensures that only the hidden layer outputs are transmitted, thereby preventing the external transmission of privacy-sensitive raw input and output data. given the black-box nature of dnns, estimating the original input or output from intercepted hidden layer outputs poses a significant challenge for malicious eavesdroppers. moreover, $\lambda$-split is orthogonal to traditional encryption-based security mechanisms, offering enhanced security when deployed in conjunction. we empirically validate the efficacy of the $\lambda$-split framework using llama 2 and stable diffusion xl, representative large language and diffusion models developed by meta and stability ai, respectively. our $\lambda$-split implementation is publicly accessible at https://github.com/nishio-laboratory/lambda_split.
Eren Kurshan
Abstract: ai faces a trifecta of grand challenges the energy wall, the alignment problem and the leap from narrow ai to agi. contemporary ai solutions consume unsustainable amounts of energy during model training and daily operations.making things worse, the amount of computation required to train each new ai model has been doubling every 2 months since 2020, directly translating to increases in energy consumption.the leap from ai to agi requires multiple functional subsystems operating in a balanced manner, which requires a system architecture. however, the current approach to artificial intelligence lacks system design; even though system characteristics play a key role in the human brain from the way it processes information to how it makes decisions. similarly, current alignment and ai ethics approaches largely ignore system design, yet studies show that the brains system architecture plays a critical role in healthy moral decisions.in this paper, we argue that system design is critically important in overcoming all three grand challenges. we posit that system design is the missing piece in overcoming the grand challenges.we present a systematic ai approach for agi that utilizes system design principles for agi, while providing ways to overcome the energy wall and the alignment challenges.

2023-10-22

Rishabh Bhardwaj, Soujanya Poria
Abstract: red-teaming has been a widely adopted way to evaluate the harmfulness of large language models (llms). it aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query. existing methods are primarily based on input text-based red-teaming such as adversarial prompts, low-resource prompts, or contextualized prompts to condition the model in a way to bypass its safe behavior. bypassing the guardrails uncovers hidden harmful information and biases in the model that are left untreated or newly introduced by its safety training. however, prompt-based attacks fail to provide such a diagnosis owing to their low attack success rate, and applicability to specific models. in this paper, we present a new perspective on llm safety research i.e., parametric red-teaming through unalignment. it simply (instruction) tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior. unalignment using as few as 100 examples can significantly bypass commonly referred to as chatgpt, to the point where it responds with an 88% success rate to harmful queries on two safety benchmark datasets. on open-source models such as vicuna-7b and llama-2-chat 7b and 13b, it shows an attack success rate of more than 91%. on bias evaluations, unalignment exposes inherent biases in safety-aligned models such as chatgpt and llama- 2-chat where the model's responses are strongly biased and opinionated 64% of the time.
Inez Okulska, Emilia Wiśnios
Abstract: adult content detection still poses a great challenge for automation. existing classifiers primarily focus on distinguishing between erotic and non-erotic texts. however, they often need more nuance in assessing the potential harm. unfortunately, the content of this nature falls beyond the reach of generative models due to its potentially harmful nature. ethical restrictions prohibit large language models (llms) from analyzing and classifying harmful erotics, let alone generating them to create synthetic datasets for other neural models. in such instances where data is scarce and challenging, a thorough analysis of the structure of such texts rather than a large model may offer a viable solution. especially given that harmful erotic narratives, despite appearing similar to harmless ones, usually reveal their harmful nature first through contextual information hidden in the non-sexual parts of the narrative. this paper introduces a hybrid neural and rule-based context-aware system that leverages coreference resolution to identify harmful contextual cues in erotic content. collaborating with professional moderators, we compiled a dataset and developed a classifier capable of distinguishing harmful from non-harmful erotic content. our hybrid model, tested on polish text, demonstrates a promising accuracy of 84% and a recall of 80%. models based on roberta and longformer without explicit usage of coreference chains achieved significantly weaker results, underscoring the importance of coreference resolution in detecting such nuanced content as harmful erotics. this approach also offers the potential for enhanced visual explainability, supporting moderators in evaluating predictions and taking necessary actions to address harmful content.
Mahdi Zakizadeh, Kaveh Eskandari Miandoab, Mohammad Taher Pilehvar
Abstract: numerous debiasing techniques have been proposed to mitigate the gender bias that is prevalent in pretrained language models. these are often evaluated on datasets that check the extent to which the model is gender-neutral in its predictions. importantly, this evaluation protocol overlooks the possible adverse impact of bias mitigation on useful gender knowledge. to fill this gap, we propose difair, a manually curated dataset based on masked language modeling objectives. difair allows us to introduce a unified metric, gender invariance score, that not only quantifies a model's biased behavior, but also checks if useful gender knowledge is preserved. we use difair as a benchmark for a number of widely-used pretained language models and debiasing techniques. experimental results corroborate previous findings on the existing gender biases, while also demonstrating that although debiasing techniques ameliorate the issue of gender bias, this improvement usually comes at the price of lowering useful gender knowledge of the model.
Marvin Li, Jason Wang, Jeffrey Wang, Seth Neel
Abstract: recent work has shown that large language models (llms) can unintentionally leak sensitive information present in their training data. in this paper, we present model perturbations (mope), a new method to identify with high confidence if a given text is in the training data of a pre-trained language model, given white-box access to the models parameters. mope adds noise to the model in parameter space and measures the drop in log-likelihood at a given point $x$, a statistic we show approximates the trace of the hessian matrix with respect to model parameters. across language models ranging from $70$m to $12$b parameters, we show that mope is more effective than existing loss-based attacks and recently proposed perturbation-based methods. we also examine the role of training point order and model size in attack success, and empirically demonstrate that mope accurately approximate the trace of the hessian in practice. our results show that the loss of a point alone is insufficient to determine extractability -- there are training points we can recover using our method that have average loss. this casts some doubt on prior works that use the loss of a point as evidence of memorization or unlearning.
Ross Gruetzemacher, Alan Chan, Kevin Frazier, Christy Manning, Štěpán Los, James Fox, José Hernández-Orallo, John Burden, Matija Franklin, Clíodhna Ní Ghuidhir, Mark Bailey, Daniel Eth, Toby Pilditch, Kyle Kilian
Abstract: given rapid progress toward advanced ai and risks from frontier ai systems (advanced ai systems pushing the boundaries of the ai capabilities frontier), the creation and implementation of ai governance and regulatory schemes deserves prioritization and substantial investment. however, the status quo is untenable and, frankly, dangerous. a regulatory gap has permitted ai labs to conduct research, development, and deployment activities with minimal oversight. in response, frontier ai system evaluations have been proposed as a way of assessing risks from the development and deployment of frontier ai systems. yet, the budding ai risk evaluation ecosystem faces significant coordination challenges, such as a limited diversity of evaluators, suboptimal allocation of effort, and perverse incentives. this paper proposes a solution in the form of an international consortium for ai risk evaluations, comprising both ai developers and third-party ai risk evaluators. such a consortium could play a critical role in international efforts to mitigate societal-scale risks from advanced ai, including in managing responsible scaling policies and coordinated evaluation-based risk response. in this paper, we discuss the current evaluation ecosystem and its shortcomings, propose an international consortium for advanced ai risk evaluations, discuss issues regarding its implementation, discuss lessons that can be learnt from previous international institutions and existing proposals for international ai governance institutions, and, finally, we recommend concrete steps to advance the establishment of the proposed consortium: (i) solicit feedback from stakeholders, (ii) conduct additional research, (iii) conduct a workshop(s) for stakeholders, (iv) analyze feedback and create final proposal, (v) solicit funding, and (vi) create a consortium.
Rongsheng Wang, Qi Li, Sihong Xie
Abstract: general large language models (llms) such as chatgpt have shown remarkable success, but it has also raised concerns among people about the misuse of ai-generated texts. therefore, an important question is how to detect whether the texts are generated by chatgpt or by humans. existing detectors are built on the assumption that there is a distribution gap between human-generated and ai-generated texts. these gaps are typically identified using statistical information or classifiers. in contrast to prior research methods, we find that large language models such as chatgpt exhibit strong self-consistency in text generation and continuation. self-consistency capitalizes on the intuition that ai-generated texts can still be reasoned with by large language models using the same logical reasoning when portions of the texts are masked, which differs from human-generated texts. using this observation, we subsequently proposed a new method for ai-generated texts detection based on self-consistency with masked predictions to determine whether a text is generated by llms. this method, which we call detectgpt-sc. we conducted a series of experiments to evaluate the performance of detectgpt-sc. in these experiments, we employed various mask scheme, zero-shot, and simple prompt for completing masked texts and self-consistency predictions. the results indicate that detectgpt-sc outperforms the current state-of-the-art across different tasks.

2023-10-21

Vibhor Agarwal, Yu Chen, Nishanth Sastry
Abstract: hate speech has become pervasive in today's digital age. although there has been considerable research to detect hate speech or generate counter speech to combat hateful views, these approaches still cannot completely eliminate the potential harmful societal consequences of hate speech -- hate speech, even when detected, can often not be taken down or is often not taken down enough; and hate speech unfortunately spreads quickly, often much faster than any generated counter speech. this paper investigates a relatively new yet simple and effective approach of suggesting a rephrasing of potential hate speech content even before the post is made. we show that large language models (llms) perform well on this task, outperforming state-of-the-art baselines such as bart-detox. we develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts for comprehensive experiments and conduct experiments on open-source llms such as llama-1, llama-2 chat, vicuna as well as openai's gpt-3.5. we propose various evaluation metrics to measure the efficacy of the generated text and ensure the generated text has reduced hate intensity without drastically changing the semantic meaning of the original text. we find that llms with a few-shot demonstrations prompt work the best in generating acceptable hate-rephrased text with semantic meaning similar to the original text. overall, we find that gpt-3.5 outperforms the baseline and open-source models for all the different kinds of prompts. we also perform human evaluations and interestingly, find that the rephrasings generated by gpt-3.5 outperform even the human-generated ground-truth rephrasings in the dataset. we also conduct detailed ablation studies to investigate why llms work satisfactorily on this task and conduct a failure analysis to understand the gaps.
Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray
Abstract: code large language models (code llms) are being increasingly employed in real-life applications, so evaluating them is critical. while the general accuracy of code llms on individual tasks has been extensively evaluated, their self-consistency across different tasks is overlooked. intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. in this paper, we first formally define the self-consistency of code llms and then design a framework, identitychain, which effectively and efficiently evaluates the self-consistency and general accuracy of a model at the same time. we study eleven code llms and show that they fail to preserve self-consistency, which is indeed a distinct aspect from general accuracy. furthermore, we show that identitychain can be used as a model debugging tool to expose weaknesses of code llms by demonstrating three major weaknesses that we identify in current models using identitychain. our code is available at https://github.com/marcusm117/identitychain.

2023-10-20

Ruixiang Tang, Gord Lueck, Rodolfo Quispe, Huseyin A Inan, Janardhan Kulkarni, Xia Hu
Abstract: large language models have revolutionized the field of nlp by achieving state-of-the-art performance on various tasks. however, there is a concern that these models may disclose information in the training data. in this study, we focus on the summarization task and investigate the membership inference (mi) attack: given a sample and black-box access to a model's api, it is possible to determine if the sample was part of the training data. we exploit text similarity and the model's resistance to document modifications as potential mi signals and evaluate their effectiveness on widely used datasets. our results demonstrate that summarization models are at risk of exposing data membership, even in cases where the reference summary is not available. furthermore, we discuss several safeguards for training summarization models to protect against mi attacks and discuss the inherent trade-off between privacy and utility.
Xiaoliang Chen, Liangbin Li, Le Chang, Yunhe Huang, Yuxuan Zhao, Yuxiao Zhang, Dinuo Li
Abstract: with the development of large language models (llms) like the gpt series, their widespread use across various application scenarios presents a myriad of challenges. this review initially explores the issue of domain specificity, where llms may struggle to provide precise answers to specialized questions within niche fields. the problem of knowledge forgetting arises as these llms might find it hard to balance old and new information. the knowledge repetition phenomenon reveals that sometimes llms might deliver overly mechanized responses, lacking depth and originality. furthermore, knowledge illusion describes situations where llms might provide answers that seem insightful but are actually superficial, while knowledge toxicity focuses on harmful or biased information outputs. these challenges underscore problems in the training data and algorithmic design of llms. to address these issues, it's suggested to diversify training data, fine-tune models, enhance transparency and interpretability, and incorporate ethics and fairness training. future technological trends might lean towards iterative methodologies, multimodal learning, model personalization and customization, and real-time learning and feedback mechanisms. in conclusion, future llms should prioritize fairness, transparency, and ethics, ensuring they uphold high moral and ethical standards when serving humanity.
Xilie Xu, Keyi Kong, Ning Liu, Lizhen Cui, Di Wang, Jingfeng Zhang, Mohan Kankanhalli
Abstract: the wide-ranging applications of large language models (llms), especially in safety-critical domains, necessitate the proper evaluation of the llm's adversarial robustness. this paper proposes an efficient tool to audit the llm's adversarial robustness via a prompt-based adversarial attack (promptattack). promptattack converts adversarial textual attacks into an attack prompt that can cause the victim llm to output the adversarial sample to fool itself. the attack prompt is composed of three important components: (1) original input (oi) including the original sample and its ground-truth label, (2) attack objective (ao) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (ag) containing the perturbation instructions to guide the llm on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. besides, we use a fidelity filter to ensure that promptattack maintains the original semantic meanings of the adversarial examples. further, we enhance the attack power of promptattack by ensembling adversarial examples at different perturbation levels. comprehensive empirical results using llama2 and gpt-3.5 validate that promptattack consistently yields a much higher attack success rate compared to advglue and advglue++. interesting findings include that a simple emoji can easily mislead gpt-3.5 to make wrong predictions.
Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
Abstract: instruction tuning of open-source large language models (llms) like llama, using direct outputs from more powerful llms such as instruct-gpt and gpt-4, has proven to be a cost-effective way to align model behaviors with human preferences. however, the instruction-tuned model has only seen one response per instruction, lacking the knowledge of potentially better responses. in this paper, we propose finetuning an instruction-tuned llm using our novel \textit{probabilistic ranking} and \textit{contextual ranking} approaches to increase the likelihood of generating better responses. probabilistic ranking enables the instruction-tuned model to inherit the relative rankings of high-quality and low-quality responses from the teacher llm. on the other hand, learning with contextual ranking allows the model to refine its own response distribution using the contextual understanding ability of stronger llms. furthermore, we apply probabilistic ranking and contextual ranking sequentially to the instruction-tuned llm. the resulting model, which we call \textbf{tuna}, consistently improves the performance on super natural instructions (119 test tasks), lmentry (25 test tasks), vicuna qa, and can even obtain better results than several strong reinforcement learning baselines. our code and data are available at \url{ https://github.com/microsoft/lmops}.
Shehzaad Dhuliawala, Vilém Zouhar, Mennatallah El-Assady, Mrinmaya Sachan
Abstract: in a human-ai collaboration, users build a mental model of the ai system based on its reliability and how it presents its decision, e.g. its presentation of system confidence and an explanation of the output. modern nlp systems are often uncalibrated, resulting in confidently incorrect predictions that undermine user trust. in order to build trustworthy ai, we must understand how user trust is developed and how it can be regained after potential trust-eroding events. we study the evolution of user trust in response to these trust-eroding events using a betting game. we find that even a few incorrect instances with inaccurate confidence estimates damage user trust and performance, with very slow recovery. we also show that this degradation in trust reduces the success of human-ai collaboration and that different types of miscalibration -- unconfidently correct and confidently incorrect -- have different negative effects on user trust. our findings highlight the importance of calibration in user-facing ai applications and shed light on what aspects help users decide whether to trust the ai system.
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam Mccandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez
Abstract: reinforcement learning from human feedback (rlhf) is a popular technique for training high-quality ai assistants. however, rlhf may also encourage model responses that match user beliefs over truthful responses, a behavior known as sycophancy. we investigate the prevalence of sycophancy in rlhf-trained models and whether human preference judgements are responsible. we first demonstrate that five state-of-the-art ai assistants consistently exhibit sycophantic behavior across four varied free-form text-generation tasks. to understand if human preferences drive this broadly observed behavior of rlhf models, we analyze existing human preference data. we find that when a response matches a user's views, it is more likely to be preferred. moreover, both humans and preference models (pms) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. optimizing model outputs against pms also sometimes sacrifices truthfulness in favor of sycophancy. overall, our results indicate that sycophancy is a general behavior of rlhf models, likely driven in part by human preference judgements favoring sycophantic responses.
Dorian Quelle, Alexandre Bovet
Abstract: autonomous fact-checking, using machine learning to verify claims, has grown vital as misinformation spreads beyond human fact-checking capacity. large language models (llms) like gpt-4 are increasingly trusted to verify information and write academic papers, lawsuits, and news articles, emphasizing their role in discerning truth from falsehood and the importance of being able to verify their outputs. here, we evaluate the use of llm agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. importantly, in our framework, agents explain their reasoning and cite the relevant sources from the retrieved context. our results show the enhanced prowess of llms when equipped with contextual information. gpt-4 outperforms gpt-3, but accuracy varies based on query language and claim veracity. while llms show promise in fact-checking, caution is essential due to inconsistent accuracy. our investigation calls for further research, fostering a deeper comprehension of when agents succeed and when they fail.
Sullam Jeoung, Yubin Ge, Jana Diesner
Abstract: large language models (llms) have been observed to encode and perpetuate harmful associations present in the training data. we propose a theoretically grounded framework called stereomap to gain insights into their perceptions of how demographic groups have been viewed by society. the framework is grounded in the stereotype content model (scm); a well-established theory from psychology. according to scm, stereotypes are not all alike. instead, the dimensions of warmth and competence serve as the factors that delineate the nature of stereotypes. based on the scm theory, stereomap maps llms' perceptions of social groups (defined by socio-demographic features) using the dimensions of warmth and competence. furthermore, the framework enables the investigation of keywords and verbalizations of reasoning of llms' judgments to uncover underlying factors influencing their perceptions. our results show that llms exhibit a diverse range of perceptions towards these groups, characterized by mixed evaluations along the dimensions of warmth and competence. furthermore, analyzing the reasonings of llms, our findings indicate that llms demonstrate an awareness of social disparities, often stating statistical data and research findings to support their reasoning. this study contributes to the understanding of how llms perceive and represent social groups, shedding light on their potential biases and the perpetuation of harmful associations.
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden Mclean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova Dassarma, Oliver Rausch, Robin Larson, Shannon Yang, Shauna Kravec, Timothy Telleen-Lawton, Thomas I. Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Sören Mindermann, Nicholas Joseph, Sam Mccandlish, Jared Kaplan
Abstract: human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. constitutional ai offers an alternative, replacing human feedback with feedback from ai models conditioned only on a list of written principles. we find this approach effectively prevents the expression of such behaviors. the success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? to test this, we run experiments using a principle roughly stated as "do what's best for humanity". we find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. a general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. however, more detailed constitutions still improve fine-grained control over specific types of harms. this suggests both general and specific principles have value for steering ai safely.

2023-10-19

Xiangjue Dong, Ziwei Zhu, Zhuoer Wang, Maria Teleki, James Caverlee
Abstract: pre-trained language models are widely used in many important real-world applications. however, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. to address this challenge, we propose co$^2$pt, an efficient and effective debias-while-prompt tuning method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of co$^2$pt on bias mitigation during the prompt tuning process and its adaptability to existing upstream debiased language models. these findings indicate the strength of co$^2$pt and provide promising avenues for further enhancement in bias mitigation on downstream tasks.
Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, Xiangnan He
Abstract: large language models (llms) are susceptible to red teaming attacks, which can induce llms to generate harmful content. previous research constructs attack prompts via manual or automatic methods, which have their own limitations on construction cost and quality. to address these issues, we propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts. specifically, considering the impressive capabilities of newly emerged llms, we propose an attack framework to instruct llms to mimic human-generated prompts through in-context learning. furthermore, we propose a defense framework that fine-tunes victim llms through iterative interactions with the attack framework to enhance their safety against red teaming attacks. extensive experiments on different llms validate the effectiveness of our proposed attack and defense frameworks. additionally, we release a series of attack prompts datasets named sap with varying sizes, facilitating the safety evaluation and enhancement of more llms. our code and dataset is available on https://github.com/aatrox103/sap .
Xiaodong Yu, Hao Cheng, Xiaodong Liu, Dan Roth, Jianfeng Gao
Abstract: although remarkable progress has been achieved in preventing large language model (llm) hallucinations using instruction tuning and retrieval augmentation, it remains challenging to measure the reliability of llms using human-crafted evaluation data which is not available for many tasks and domains and could suffer from data leakage. inspired by adversarial machine learning, this paper aims to develop a method of automatically generating evaluation data by appropriately modifying existing data on which llms behave faithfully. specifically, this paper presents autodebug, an llm-based framework to use prompting chaining to generate transferable adversarial attacks in the form of question-answering examples. we seek to understand the extent to which these examples trigger the hallucination behaviors of llms. we implement autodebug using chatgpt and evaluate the resulting two variants of a popular open-domain question-answering dataset, natural questions (nq), on a collection of open-source and proprietary llms under various prompting settings. our generated evaluation data is human-readable and, as we show, humans can answer these modified questions well. nevertheless, we observe pronounced accuracy drops across multiple llms including gpt-4. our experimental results show that llms are likely to hallucinate in two categories of question-answering scenarios where (1) there are conflicts between knowledge given in the prompt and their parametric knowledge, or (2) the knowledge expressed in the prompt is complex. finally, we find that the adversarial examples generated by our method are transferable across all considered llms. the examples generated by a small model can be used to debug a much larger model, making our approach cost-effective.
Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé, Jordan Boyd-Graber
Abstract: large language models (llms) are increasingly used for accessing information on the web. their truthfulness and factuality are thus of great interest. to help users make the right decisions about the information they're getting, llms should not only provide but also help users fact-check information. in this paper, we conduct experiments with 80 crowdworkers in total to compare language models with search engines (information retrieval systems) at facilitating fact-checking by human users. we prompt llms to validate a given claim and provide corresponding explanations. users reading llm explanations are significantly more efficient than using search engines with similar accuracy. however, they tend to over-rely the llms when the explanation is wrong. to reduce over-reliance on llms, we ask llms to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. this contrastive explanation mitigates users' over-reliance on llms, but cannot significantly outperform search engines. however, showing both search engine results and llm explanations offers no complementary benefits as compared to search engines alone. taken together, natural language explanations by llms may not be a reliable replacement for reading the retrieved passages yet, especially in high-stakes settings where over-relying on wrong ai explanations could lead to critical consequences.
Abhijith Chintam, Rahel Beloch, Willem Zuidema, Michael Hanna, Oskar Van Der Wal
Abstract: language models (lms) exhibit and amplify many types of undesirable biases learned from the training data, including gender bias. however, we lack tools for effectively and efficiently changing this behavior without hurting general language modeling performance. in this paper, we study three methods for identifying causal relations between lm components and particular output: causal mediation analysis, automated circuit discovery and our novel, efficient method called diffmask+ based on differential masking. we apply the methods to gpt-2 small and the problem of gender bias, and use the discovered sets of components to perform parameter-efficient fine-tuning for bias mitigation. our results show significant overlap in the identified components (despite huge differences in the computational requirements of the methods) as well as success in mitigating gender bias, with less damage to general language modeling compared to full model fine-tuning. however, our work also underscores the difficulty of defining and measuring bias, and the sensitivity of causal discovery procedures to dataset choice. we hope our work can contribute to more attention for dataset development, and lead to more effective mitigation strategies for other types of bias.
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
Abstract: with the development of large language models (llms), striking a balance between the performance and safety of ai systems has never been more critical. however, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during llm training. to address this issue, we propose safe reinforcement learning from human feedback (safe rlhf), a novel algorithm for human value alignment. safe rlhf explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. we formalize the safety concern of llms as an optimization task of maximizing the reward function while satisfying specified cost constraints. leveraging the lagrangian method to solve this constrained problem, safe rlhf dynamically adjusts the balance between the two objectives during fine-tuning. through a three-round fine-tuning using safe rlhf, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. experimentally, we fine-tuned the alpaca-7b using safe rlhf and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, Neil Zhenqiang Gong
Abstract: large language models (llms) are increasingly deployed as the backend for a variety of real-world applications called llm-integrated applications. multiple recent works showed that llm-integrated applications are vulnerable to prompt injection attacks, in which an attacker injects malicious instruction/data into the input of those applications such that they produce results as the attacker desires. however, existing works are limited to case studies. as a result, the literature lacks a systematic understanding of prompt injection attacks and their defenses. we aim to bridge the gap in this work. in particular, we propose a general framework to formalize prompt injection attacks. existing attacks, which are discussed in research papers and blog posts, are special cases in our framework. our framework enables us to design a new attack by combining existing attacks. moreover, we also propose a framework to systematize defenses against prompt injection attacks. using our frameworks, we conduct a systematic evaluation on prompt injection attacks and their defenses with 10 llms and 7 tasks. we hope our frameworks can inspire future research in this field. our code is available at https://github.com/liu00222/open-prompt-injection.
Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, Punyajoy Saha
Abstract: recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. however, none of these works aim to use explanation, additional context and victim community information in the detection process. we utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). we select three large language models (gpt-3.5, text-davinci and flan-t5) and three datasets - hatexplain, implicit hate and toxicspans. we find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. there is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. in addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. such vulnerable points automatically constitute 'jailbreak' prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, Percy Liang
Abstract: foundation models have rapidly permeated society, catalyzing a wave of generative ai applications spanning enterprise and consumer-facing contexts. while the societal impact of foundation models is growing, transparency is on the decline, mirroring the opacity that has plagued past digital technologies (e.g. social media). reversing this trend is essential: transparency is a vital precondition for public accountability, scientific innovation, and effective governance. to assess the transparency of the foundation model ecosystem and help improve transparency over time, we introduce the foundation model transparency index. the foundation model transparency index specifies 100 fine-grained indicators that comprehensively codify transparency for foundation models, spanning the upstream resources used to build a foundation model (e.g data, labor, compute), details about the model itself (e.g. size, capabilities, risks), and the downstream use (e.g. distribution channels, usage policies, affected geographies). we score 10 major foundation model developers (e.g. openai, google, meta) against the 100 indicators to assess their transparency. to facilitate and standardize assessment, we score developers in relation to their practices for their flagship foundation model (e.g. gpt-4 for openai, palm 2 for google, llama 2 for meta). we present 10 top-level findings about the foundation model ecosystem: for example, no developer currently discloses significant information about the downstream impact of its flagship model, such as the number of users, affected market sectors, or how users can seek redress for harm. overall, the foundation model transparency index establishes the level of transparency today to drive progress on foundation model governance via industry standards and regulatory intervention.
Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning
Abstract: widely used language models (lms) are typically built by scaling up a two-stage training pipeline: a pre-training stage that uses a very large, diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that uses targeted examples or other specifications of desired behaviors. while it has been hypothesized that knowledge and skills come from pre-training, and fine-tuning mostly filters this knowledge and skillset, this intuition has not been extensively tested. to aid in doing so, we introduce a novel technique for decoupling the knowledge and skills gained in these two stages, enabling a direct answer to the question, "what would happen if we combined the knowledge learned by a large model during pre-training with the knowledge learned by a small model during fine-tuning (or vice versa)?" using an rl-based framework derived from recent developments in learning from human preferences, we introduce emulated fine-tuning (eft), a principled and practical method for sampling from a distribution that approximates (or 'emulates') the result of pre-training and fine-tuning at different scales. our experiments with eft show that scaling up fine-tuning tends to improve helpfulness, while scaling up pre-training tends to improve factuality. beyond decoupling scale, we show that eft enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. finally, a special case of emulated fine-tuning, which we call lm up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models, essentially emulating the result of fine-tuning the large pre-trained model. up-scaling consistently improves helpfulness and factuality of instruction-following models in the llama, llama-2, and falcon families, without additional hyperparameters or training.
Sergey Berezin, Reza Farahbakhsh, Noel Crespi
Abstract: we introduce a simple yet efficient sentence-level attack on black-box toxicity detector models. by adding several positive words or sentences to the end of a hateful message, we are able to change the prediction of a neural network and pass the toxicity detection system check. this approach is shown to be working on seven languages from three different language families. we also describe the defence mechanism against the aforementioned attack and discuss its limitations.
Zefang Liu, John Buford
Abstract: anomaly detection in command shell sessions is a critical aspect of computer security. recent advances in deep learning and natural language processing, particularly transformer-based models, have shown great promise for addressing complex security challenges. in this paper, we implement a comprehensive approach to detect anomalies in unix shell sessions using a pretrained distilbert model, leveraging both unsupervised and supervised learning techniques to identify anomalous activity while minimizing data labeling. the unsupervised method captures the underlying structure and syntax of unix shell commands, enabling the detection of session deviations from normal behavior. experiments on a large-scale enterprise dataset collected from production systems demonstrate the effectiveness of our approach in detecting anomalous behavior in unix shell sessions. this work highlights the potential of leveraging recent advances in transformers to address important computer security challenges.

2023-10-18

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Heng Huang, Jiuxiang Gu, Tianyi Zhou
Abstract: recent advancements in large language models (llms) have expanded the horizons of natural language understanding and generation. notably, the output control and alignment with the input of llms can be refined through instruction tuning. however, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading llm outputs. we propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of llms. this approach utilizes an oracle llm to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. extensive experiments on widely used evaluation benchmarks show that llms trained with our recycled data outperform those trained with existing datasets in various benchmarks.
Guande He, Peng Cui, Jianfei Chen, Wenbo Hu, Jun Zhu
Abstract: despite the significant progress made in practical applications of aligned language models (lms), they tend to be overconfident in output answers compared to the corresponding pre-trained lms. in this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of lms under the multiple-choice setting. we first conduct a thoughtful empirical study on how aligned lms differ in calibration from their pre-trained counterparts. experimental results reveal that there are two distinct uncertainties in lms under the multiple-choice setting, which are responsible for the answer decision and the format preference of the lms, respectively. then, we investigate the role of these two uncertainties on aligned lm's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned lms' overconfidence is the conflation of these two types of uncertainty. furthermore, we examine the utility of common post-hoc calibration methods for aligned lms and propose an easy-to-implement and sample-efficient method to calibrate aligned lms. we hope our findings could provide insights into the design of more reliable alignment processes for lms.
Yuval Pinter, Michael Elhadad
Abstract: we call into question the recently popularized method of direct model editing as a means of correcting factual errors in llm generations. we contrast model editing with three similar but distinct approaches that pursue better defined objectives: (1) retrieval-based architectures, which decouple factual memory from inference and linguistic capabilities embodied in llms; (2) concept erasure methods, which aim at preventing systemic bias in generated text; and (3) attribution methods, which aim at grounding generations into identified textual sources. we argue that direct model editing cannot be trusted as a systematic remedy for the disadvantages inherent to llms, and while it has proven potential in improving model explainability, it opens risks by reinforcing the notion that models can be trusted for factuality. we call for cautious promotion and application of model editing as part of the llm deployment process, and for responsibly limiting the use cases of llms to those not relying on editing as a critical component.
Laura Weidinger, Maribeth Rauh, Nahema Marchal, Arianna Manzini, Lisa Anne Hendricks, Juan Mateos-Garcia, Stevie Bergman, Jackie Kay, Conor Griffin, Ben Bariach, Iason Gabriel, Verena Rieser, William Isaac
Abstract: generative ai systems produce a range of risks. to ensure the safety of generative ai systems, these risks must be evaluated. in this paper, we make two main contributions toward establishing such evaluations. first, we propose a three-layered framework that takes a structured, sociotechnical approach to evaluating these risks. this framework encompasses capability evaluations, which are the main current approach to safety evaluation. it then reaches further by building on system safety principles, particularly the insight that context determines whether a given capability may cause harm. to account for relevant context, our framework adds human interaction and systemic impacts as additional layers of evaluation. second, we survey the current state of safety evaluation of generative ai systems and create a repository of existing evaluations. three salient evaluation gaps emerge from this analysis. we propose ways forward to closing these gaps, outlining practical steps as well as roles and responsibilities for different actors. sociotechnical safety evaluation is a tractable approach to the robust and comprehensive safety evaluation of generative ai systems.
Xiang Chen, Duanzheng Song, Honghao Gui, Chengxi Wang, Ningyu Zhang, Fei Huang, Chengfei Lv, Dan Zhang, Huajun Chen
Abstract: large language models (llms), such as chatgpt/gpt-4, have garnered widespread attention owing to their myriad of practical applications, yet their adoption has been constrained by issues of fact-conflicting hallucinations across web platforms. the assessment of factuality in text, produced by llms, remains inadequately explored, extending not only to the judgment of vanilla facts but also encompassing the evaluation of factual errors emerging in complex inferential tasks like multi-hop, and etc. in response, we introduce factchd, a fact-conflicting hallucination detection benchmark meticulously designed for llms. functioning as a pivotal tool in evaluating factuality within "query-respons" contexts, our benchmark assimilates a large-scale dataset, encapsulating a broad spectrum of factuality patterns, such as vanilla, multi-hops, comparison, and set-operation patterns. a distinctive feature of our benchmark is its incorporation of fact-based chains of evidence, thereby facilitating comprehensive and conducive factual reasoning throughout the assessment process. we evaluate multiple llms, demonstrating the effectiveness of the benchmark and current methods fall short of faithfully detecting factual errors. furthermore, we present truth-triangulator that synthesizes reflective considerations by tool-enhanced chatgpt and lora-tuning based on llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence. the benchmark dataset and source code will be made available in https://github.com/zjunlp/factchd.
Giuseppe Attanasio, Flor Miriam Plaza-Del-Arco, Debora Nozza, Anne Lauscher
Abstract: recent instruction fine-tuned models can solve multiple nlp tasks when prompted to do so, with machine translation (mt) being a prominent use case. however, current research often focuses on standard performance benchmarks, leaving compelling fairness and ethical considerations behind. in mt, this might lead to misgendered translations, resulting, among other harms, in the perpetuation of stereotypes and prejudices. in this work, we address this gap by investigating whether and to what extent such models exhibit gender bias in machine translation and how we can mitigate it. concretely, we compute established gender bias metrics on the winomt corpus from english to german and spanish. we discover that ift models default to male-inflected translations, even disregarding female occupational stereotypes. next, using interpretability methods, we unveil that models systematically overlook the pronoun indicating the gender of a target occupation in misgendered translations. finally, based on this finding, we propose an easy-to-implement and effective bias mitigation solution based on few-shot learning that leads to significantly fairer translations.
Meng Tong, Kejiang Chen, Yuang Qi, Jie Zhang, Weiming Zhang, Nenghai Yu
Abstract: large language models (llms), such as chatgpt, have simplified text generation tasks, yet their inherent privacy risks are increasingly garnering attention. existing solutions for privacy-preserving inference face significant challenges in practical deployment and implementation. in this paper, we propose privinfer, the first practical framework for privacy-preserving inference. it comprises two modules specifically designed for black-box llms in text generation. the perturbation module, employing differential privacy, generates perturbed prompts, thus enabling privacy-preserving inference with black-box llms. the restoration module extracts coherent and meaningful responses from obtained perturbed results, thus ensuring the accomplishment of the text generation tasks. additionally, to enhance privacy and utility further, we develop rantext, a novel differential privacy mechanism integrated into the perturbation module of privinfer. this mechanism is specifically tailored for llms and utilizes random adjacency in text perturbations. experimental results indicate that privinfer is comparable to gpt-4 in text generation quality, and rantext outperforms the current leading scheme in privacy protection, even under its adaptive attack, our proposed gpt inference attack.
Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, Farinaz Koushanfar
Abstract: we present remark-llm, a novel efficient, and robust watermarking framework designed for texts generated by large language models (llms). synthesizing human-like content using llms necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (ip). however, the generated content is prone to malicious exploitation, including spamming and plagiarism. to address the challenges, remark-llm proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into llm-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. remark-llm is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. extensive evaluations on multiple unseen datasets highlight remark-llm proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. furthermore, remark-llm exhibits better resilience against a spectrum of watermark detection and removal attacks.
Hongwei Yao, Jian Lou, Zhan Qin
Abstract: prompts have significantly improved the performance of pretrained large language models (llms) on various downstream tasks recently, making them increasingly indispensable for a diverse range of llm application scenarios. however, the backdoor vulnerability, a serious security threat that can maliciously alter the victim model's normal predictions, has not been sufficiently explored for prompt-based llms. in this paper, we present poisonprompt, a novel backdoor attack capable of successfully compromising both hard and soft prompt-based llms. we evaluate the effectiveness, fidelity, and robustness of poisonprompt through extensive experiments on three popular prompt methods, using six datasets and three widely used llms. our findings highlight the potential security threats posed by backdoor attacks on prompt-based llms and emphasize the need for further research in this area.
Xiang Shi, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu
Abstract: the advent of large language models (llms) has shown the potential to improve relevance and provide direct answers in web searches. however, challenges arise in validating the reliability of generated results and the credibility of contributing sources, due to the limitations of traditional information retrieval algorithms and the llm hallucination problem. aiming to create a "pagerank" for the llm era, we strive to transform llm into a relevant, responsible, and trustworthy searcher. we propose a novel generative retrieval framework leveraging the knowledge of llms to foster a direct link between queries and online sources. this framework consists of three core modules: generator, validator, and optimizer, each focusing on generating trustworthy online sources, verifying source reliability, and refining unreliable sources, respectively. extensive experiments and evaluations highlight our method's superior relevance, responsibility, and trustfulness against various sota methods.

2023-10-17

Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
Abstract: large language models (llms) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. despite extensive study on specific issues like bias, the intrinsic values of llms remain largely unexplored from a moral philosophy perspective. this work delves into ethical values utilizing moral foundation theory. moving beyond conventional discriminative evaluations with poor reliability, we propose denevil, a novel prompt generation algorithm tailored to dynamically exploit llms' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. on such a basis, we construct moralprompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of llms. we discovered that most models are essentially misaligned, necessitating further ethical value alignment. in response, we develop vilmo, an in-context alignment method that substantially enhances the value compliance of llm outputs by learning to generate appropriate value instructions, outperforming existing competitors. our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of llms.
Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-Yi Lee
Abstract: recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (llms) such as chatgpt and gpt-4. these llm-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. the traditional biases investigation methods often rely on human-written test cases. however, these test cases are usually expensive and limited. in this work, we propose a first-of-its-kind method that automatically generates test cases to detect llms' potential gender bias. we apply our method to three well-known llms and find that the generated test cases effectively identify the presence of biases. to address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. the experimental results show that llms generate fairer responses with the proposed approach.
Enyu Zhou, Rui Zheng, Zhiheng Xi, Songyang Gao, Xiaoran Fan, Zichu Fei, Jingting Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: reports of human-like behaviors in foundation models are growing, with psychological theories providing enduring tools to investigate these behaviors. however, current research tends to directly apply these human-oriented tools without verifying the faithfulness of their outcomes. in this paper, we introduce a framework, realbehavior, which is designed to characterize the humanoid behaviors of models faithfully. beyond simply measuring behaviors, our framework assesses the faithfulness of results based on reproducibility, internal and external consistency, and generalizability. our findings suggest that a simple application of psychological tools cannot faithfully characterize all human-like behaviors. moreover, we discuss the impacts of aligning models with human and social values, arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.
Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu
Abstract: abuse of large language models reveals high risks as large language models are being deployed at an astonishing speed. it is important to protect the model weights to avoid malicious usage that violates licenses of open-source large language models. this paper proposes a novel watermarking strategy that plants watermarks in the quantization process of large language models without pre-defined triggers during inference. the watermark works when the model is used in the fp32 mode and remains hidden when the model is quantized to int8, in this way, the users can only inference the model without further supervised fine-tuning of the model. we successfully plant the watermark into open-source large language model weights including gpt-neo and llama. we hope our proposed method can provide a potential direction for protecting model weights in the era of large language model applications.
Rui Wen, Tianhao Wang, Michael Backes, Yang Zhang, Ahmed Salem
Abstract: large language models (llms) are powerful tools for natural language processing, enabling novel applications and user experiences. however, to achieve optimal performance, llms often require adaptation with private data, which poses privacy and security challenges. several techniques have been proposed to adapt llms with private data, such as low-rank adaptation (lora), soft prompt tuning (spt), and in-context learning (icl), but their comparative privacy and security properties have not been systematically investigated. in this work, we fill this gap by evaluating the robustness of lora, spt, and icl against three types of well-established attacks: membership inference, which exposes data leakage (privacy); backdoor, which injects malicious behavior (security); and model stealing, which can violate intellectual property (privacy and security). our results show that there is no silver bullet for privacy and security in llm adaptation and each technique has different strengths and weaknesses.
Andreas Happe, Aaron Kaplan, Jürgen Cito
Abstract: penetration testing, an essential component of cybersecurity, allows organizations to proactively identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against potential cyberattacks. one recent advancement in the realm of penetration testing is the utilization of language models (llms). we explore the intersection of llms and penetration testing to gain insight into their capabilities and challenges in the context of privilige escalation. we create an automated linux privilege-escalation benchmark utilizing local virtual machines. we introduce an llm-guided privilege-escalation tool designed for evaluating different llms and prompt strategies against our benchmark. we analyze the impact of different prompt designs, the benefits of in-context learning, and the advantages of offering high-level guidance to llms. we discuss challenging areas for llms, including maintaining focus during testing, coping with errors, and finally comparing them with both stochastic parrots as well as with human hackers.
Siyan Zhao, John Dang, Aditya Grover
Abstract: many applications of large language models (llms), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. we introduce group preference optimization (gpo), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. in gpo, we augment the base llm with an independent transformer module trained to predict the preferences of a group for the llm generations. for few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. we empirically validate the efficacy of gpo through rigorous evaluations using llms with varied sizes on three human opinion adaptation tasks. these tasks involve adapting to the preferences of us demographic groups, global countries, and individual users. our results demonstrate that gpo not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu
Abstract: while reinforcement learning from human feedback (rlhf) aligns large language models (llms) with general, aggregate human preferences, it is suboptimal for learning diverse, individual perspectives. in this work, we study reinforcement learning from personalized human feedback (rlphf) problem, wherein llms are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a multi-objective reinforcement learning (morl) problem. compared to strong single-objective baselines, we show that we can achieve personalized alignment by decomposing preferences into multiple dimensions. these dimensions are defined based on personalizations that are declared as desirable by the user. in this work, we show that they can be efficiently trained independently in a distributed manner and combined effectively post-hoc through parameter merging. the code is available at https://github.com/joeljang/rlphf.
Belinda Z. Li, Alex Tamkin, Noah Goodman, Jacob Andreas
Abstract: language models (lms) can be directed to perform target tasks by using labeled examples or natural language prompts. but selecting examples or writing prompts for can be challenging--especially in tasks that involve unusual edge cases, demand precise articulation of nebulous preferences, or require an accurate mental model of lm behavior. we propose to use *lms themselves* to guide the task specification process. in this paper, we introduce **generative active task elicitation (gate)**: a learning framework in which models elicit and infer intended behavior through free-form, language-based interaction with users. we study gate in three domains: email validation, content recommendation, and moral reasoning. in preregistered experiments, we show that lms prompted to perform gate (e.g., by generating open-ended questions or synthesizing informative edge cases) elicit responses that are often more informative than user-written prompts or labels. users report that interactive task elicitation requires less effort than prompting or example labeling and surfaces novel considerations not initially anticipated by users. our findings suggest that lm-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.

2023-10-16

Keita Saito, Akifumi Wachi, Koki Wataoka, Youhei Akimoto
Abstract: in recent years, large language models (llms) have witnessed a remarkable surge in prevalence, altering the landscape of natural language processing and machine learning. one key factor in improving the performance of llms is alignment with humans achieved with reinforcement learning from human feedback (rlhf), as for many llms such as gpt-4, bard, etc. in addition, recent studies are investigating the replacement of human feedback with feedback from other llms named reinforcement learning from ai feedback (rlaif). we examine the biases that come along with evaluating llms with other llms and take a closer look into verbosity bias -- a bias where llms sometimes prefer more verbose answers even if they have similar qualities. we see that in our problem setting, gpt-4 prefers longer answers more than humans. we also propose a metric to measure this bias.
Shuyu Jiang, Xingshu Chen, Rui Tang
Abstract: recently, large language models (llms) with powerful general capabilities have been increasingly integrated into various web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. however, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. in this paper, we introduce an innovative technique for obfuscating harmful instructions: compositional instruction attacks (cia), which refers to attacking by combination and encapsulation of multiple instructions. cia hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. furthermore, we implement two transformation methods, known as t-cia and w-cia, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to llms. we evaluated cia on gpt-4, chatgpt, and chatglm2 with two safety assessment datasets and two harmful prompt datasets. it achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for gpt-4, 91%+ for chatgpt (gpt-3.5-turbo backed) and chatglm2-6b on harmful prompt datasets. our approach reveals the vulnerability of llms to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to llm security development. warning: this paper may contain offensive or upsetting content!
Haoran Li, Yulin Chen, Jinglong Luo, Yan Kang, Xiaojin Zhang, Qi Hu, Chunkit Chan, Yangqiu Song
Abstract: the advancement of large language models (llms) has significantly enhanced the ability to effectively tackle various downstream nlp tasks and unify these tasks into generative pipelines. on the one hand, powerful language models, trained on massive textual data, have brought unparalleled accessibility and usability for both models and users. on the other hand, unrestricted access to these models can also introduce potential malicious and unintentional privacy risks. despite ongoing efforts to address the safety and privacy concerns associated with llms, the problem remains unresolved. in this paper, we provide a comprehensive analysis of the current privacy attacks targeting llms and categorize them according to the adversary's assumed capabilities to shed light on the potential vulnerabilities present in llms. then, we present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks. beyond existing works, we identify upcoming privacy concerns as llms evolve. lastly, we point out several potential avenues for future exploration.
Kai Chen, Chunwei Wang, Kuo Yang, Jianhua Han, Lanqing Hong, Fei Mi, Hang Xu, Zhengying Liu, Wenyong Huang, Zhenguo Li, Dit-Yan Yeung, Lifeng Shang, Xin Jiang, Qun Liu
Abstract: the rapid advancement of large language models (llms) presents both opportunities and challenges, particularly concerning unintentional generation of harmful and toxic responses. while the traditional alignment methods strive to steer llms towards desired performance and shield them from malicious content, this study proposes a novel alignment strategy rooted in mistake analysis by exposing llms to flawed outputs purposefully and then conducting a thorough assessment to fully comprehend internal reasons via natural language analysis. thus, toxic responses can be transformed into instruction tuning corpus for model alignment, and llms can not only be deterred from generating flawed responses but also trained to self-criticize, leveraging its innate ability to discriminate toxic content. experimental results demonstrate that the proposed method outperforms conventional alignment techniques for safety instruction following, while maintaining superior efficiency.
Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, Jonathan Cohen
Abstract: nemo guardrails is an open-source toolkit for easily adding programmable guardrails to llm-based conversational systems. guardrails (or rails for short) are a specific way of controlling the output of an llm, such as not talking about topics considered harmful, following a predefined dialogue path, using a particular language style, and more. there are several mechanisms that allow llm providers and developers to add guardrails that are embedded into a specific model at training, e.g. using model alignment. differently, using a runtime inspired from dialogue management, nemo guardrails allows developers to add programmable rails to llm applications - these are user-defined, independent of the underlying llm, and interpretable. our initial results show that the proposed approach can be used with several llm providers to develop controllable and safe llm applications using programmable rails.
Sagi Shaier, Lawrence E. Hunter, Katharina Von Der Wense
Abstract: both standalone language models (lms) as well as lms within downstream-task systems have been shown to generate statements which are factually untrue. this problem is especially severe for low-resource languages, where training data is scarce and of worse quality than for high-resource languages. in this opinion piece, we argue that lms in their current state will never be fully trustworthy in critical settings and suggest a possible novel strategy to handle this issue: by building lms such that can cite their sources - i.e., point a user to the parts of their training data that back up their outputs. we first discuss which current nlp tasks would or would not benefit from such models. we then highlight the expected benefits such models would bring, e.g., quick verifiability of statements. we end by outlining the individual tasks that would need to be solved on the way to developing lms with the ability to cite. we hope to start a discussion about the field's current approach to building lms, especially for low-resource languages, and the role of the training data in explaining model generations.
Anirudh Som, Karan Sikka, Helen Gent, Ajay Divakaran, Andreas Kathol, Dimitra Vergyri
Abstract: paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment. supervised paraphrasers; however, rely heavily on large quantities of labelled data to help preserve meaning and intent. they also retain a large portion of the offensiveness of the original content, which raises questions on their overall usability. in this paper we aim to assist practitioners in developing usable paraphrasers by exploring in-context learning (icl) with large language models (llms), i.e., using a limited number of input-label demonstration pairs to guide the model in generating desired outputs for specific queries. our study focuses on key factors such as -- number and order of demonstrations, exclusion of prompt instruction, and reduction in measured toxicity. we perform principled evaluation on three datasets, including our proposed context-aware polite paraphrase dataset, comprising of dialogue-style rude utterances, polite paraphrases, and additional dialogue context. we evaluate our approach using two closed source and one open source llm. our results reveal that icl is comparable to supervised methods in generation quality, while being qualitatively better by 25% on human evaluation and attaining lower toxicity by 76%. also, icl-based paraphrasers only show a slight reduction in performance even with just 10% training data.
Jiaying Wu, Bryan Hooi
Abstract: it is commonly perceived that online fake news and reliable news exhibit stark differences in writing styles, such as the use of sensationalist versus objective language. however, we emphasize that style-related features can also be exploited for style-based attacks. notably, the rise of powerful large language models (llms) has enabled malicious users to mimic the style of trustworthy news outlets at minimal cost. our analysis reveals that llm-camouflaged fake news content leads to substantial performance degradation of state-of-the-art text-based detectors (up to 38% decrease in f1 score), posing a significant challenge for automated detection in online ecosystems. to address this, we introduce sheepdog, a style-agnostic fake news detector robust to news writing styles. sheepdog achieves this adaptability through llm-empowered news reframing, which customizes each article to match different writing styles using style-oriented reframing prompts. by employing style-agnostic training, sheepdog enhances its resilience to stylistic variations by maximizing prediction consistency across these diverse reframings. furthermore, sheepdog extracts content-focused veracity attributions from llms, where the news content is evaluated against a set of fact-checking rationales. these attributions provide supplementary information and potential interpretability that assist veracity prediction. on three benchmark datasets, empirical results show that sheepdog consistently yields significant improvements over competitive baselines and enhances robustness against llm-empowered style attacks.
Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, Nael Abu-Ghazaleh
Abstract: large language models (llms) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. this paper surveys research in the emerging interdisciplinary field of adversarial attacks on llms, a subfield of trustworthy ml, combining the perspectives of natural language processing and security. prior work has shown that even safety-aligned llms (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead ai systems, as evidenced by the prevalence of `jailbreak' attacks on models like chatgpt and bard. in this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. we also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. to make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd annual meeting of the association for computational linguistics (acl'24).
Christina Chance, Da Yin, Dakuo Wang, Kai-Wei Chang
Abstract: recent studies show that traditional fairytales are rife with harmful gender biases. to help mitigate these gender biases in fairytales, this work aims to assess learned biases of language models by evaluating their robustness against gender perturbations. specifically, we focus on question answering (qa) tasks in fairytales. using counterfactual data augmentation to the fairytaleqa dataset, we evaluate model robustness against swapped gender character information, and then mitigate learned biases by introducing counterfactual gender stereotypes during training time. we additionally introduce a novel approach that utilizes the massive vocabulary of language models to support text genres beyond fairytales. our experimental results suggest that models are sensitive to gender perturbations, with significant performance drops compared to the original testing set. however, when first fine-tuned on a counterfactual training dataset, models are less sensitive to the later introduced anti-gender stereotyped text.
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
Abstract: as language models (lms) become more capable, it is increasingly important to align them with human preferences. however, the dominant paradigm for training preference models (pms) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. we propose compositional preference models (cpms), a novel pm framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted lm, and aggregates these scores using a logistic regression classifier. cpms allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. our experiments show that cpms not only improve generalization and are more robust to overoptimization than standard pms, but also that best-of-n samples obtained using cpms tend to be preferred over samples obtained using conventional pms. overall, our approach demonstrates the benefits of endowing pms with priors about which features determine human preferences while relying on lm capabilities to extract those features in a scalable and robust way.

2023-10-15

Weixuan Wang, Barry Haddow, Alexandra Birch, Wei Peng
Abstract: large language models (llms) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. llms are typically evaluated using accuracy, yet this metric does not capture the vulnerability of llms to hallucination-inducing factors like prompt and context variability. how do we evaluate the capabilities of llms to consistently produce factually correct answers? in this paper, we propose model knowledge reliability score (monitor), a novel metric designed to directly measure llms' factual reliability. monitor computes the distance between the probability distributions of a valid output and its counterparts produced by the same llm probing the same fact using different styles of prompts and contexts.experiments on a comprehensive range of 12 llms demonstrate the effectiveness of monitor in evaluating the factual reliability of llms while maintaining a low computational overhead. in addition, we release the fktc (factual knowledge test corpus) test set, containing 210,158 prompts in total to foster research along this line (https://github.com/vicky-wil/monitor).
Marc Schmitt, Ivan Flechais
Abstract: the advancement of artificial intelligence (ai) and machine learning (ml) has profound implications for both the utility and security of our digital interactions. this paper investigates the transformative role of generative ai in social engineering (se) attacks. we conduct a systematic review of social engineering and ai capabilities and use a theory of social engineering to identify three pillars where generative ai amplifies the impact of se attacks: realistic content creation, advanced targeting and personalization, and automated attack infrastructure. we integrate these elements into a conceptual model designed to investigate the complex nature of ai-driven se attacks - the generative ai social engineering framework. we further explore human implications and potential countermeasures to mitigate these risks. our study aims to foster a deeper understanding of the risks, human implications, and countermeasures associated with this emerging paradigm, thereby contributing to a more secure and trustworthy human-computer interaction.

2023-10-14

Haikang Deng, Colin Raffel
Abstract: while large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. in this paper, we introduce reward-augmented decoding (rad), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. specifically, rad uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. by using a unidirectional reward model, rad can cache activations from prior generation steps to decrease computational overhead. through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that rad performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. we further validate that rad is effective on very large language models while incurring a minimal computational overhead.
Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, Wenjie Li
Abstract: language model detoxification aims to minimize the risk of generating offensive or harmful content in pretrained language models (plms) for safer deployment. existing methods can be roughly categorized as finetuning-based and decoding-based. however, the former is often resource-intensive, while the latter relies on additional components and potentially compromises the generation fluency. in this paper, we propose a more lightweight approach that enables the plm itself to achieve "self-detoxification". our method is built upon the observation that prepending a negative steering prompt can effectively induce plms to generate toxic content. at the same time, we are inspired by the recent research in the interpretability field, which formulates the evolving contextualized representations within the plm as an information stream facilitated by the attention layers. drawing on this idea, we devise a method to identify the toxification direction from the normal generation process to the one prompted with the negative prefix, and then steer the generation to the reversed direction by manipulating the information movement within the attention layers. experimental results show that our approach, without any fine-tuning or extra components, can achieve comparable performance with state-of-the-art methods.
Alex Mei, Sharon Levy, William Yang Wang
Abstract: as large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. this paper proposes assert, automated safety scenario red teaming, consisting of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. for robust safety evaluation, we apply these methods in the critical domain of ai safety to algorithmically generate a test suite of prompts covering diverse robustness settings -- semantic equivalence, related scenarios, and adversarial. we partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. despite dedicated safeguards in existing state-of-the-art models, we find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings, raising concerns for users' physical safety.

2023-10-13

Sehyun Choi, Tianqing Fang, Zhaowei Wang, Yangqiu Song
Abstract: large language models (llms) have demonstrated remarkable human-level natural language generation capabilities. however, their potential to generate misinformation, often called the hallucination problem, poses a significant risk to their deployment. a common approach to address this issue is to retrieve relevant knowledge and fine-tune the llm with the knowledge in its input. unfortunately, this method incurs high training costs and may cause catastrophic forgetting for multi-tasking models. to overcome these limitations, we propose a knowledge-constrained decoding method called kcts (knowledge-constrained tree search), which guides a frozen lm to generate text aligned with the reference knowledge at each decoding step using a knowledge classifier score and mcts (monte-carlo tree search). to adapt the sequence-level knowledge classifier to token-level guidance, we also propose a novel token-level hallucination detection method called ripa (reward inflection point approximation). our empirical results on knowledge-grounded dialogue and abstractive summarization demonstrate the strength of kcts as a plug-and-play, model-agnostic decoding method that can effectively reduce hallucinations in natural language generation.
Peihua Mai, Ran Yan, Zhe Huang, Youjia Yang, Yan Pang
Abstract: large language models (llms) shows powerful capability in natural language understanding by capturing hidden semantics in vector space. this process enriches the value of the text embeddings for various downstream tasks, thereby fostering the embedding-as-a-service (eaas) business model. however, the direct transmission of text to servers poses a largely unaddressed risk of privacy leakage. to mitigate this issue, we introduce split-n-denoise (snd), an innovative framework that split the model to execute the token embedding layer on the client side at minimal computational cost. this allows the client to introduce noise prior to transmitting the embeddings to the server, and subsequently receive and denoise the perturbed output embeddings for downstream tasks. our approach is designed for the inference stage of llms and requires no modifications to the model parameters. extensive experiments demonstrate snd's effectiveness in optimizing the privacy-utility tradeoff across various llm architectures and diverse downstream tasks. the results reveal a significant performance improvement under the same privacy budget compared to the baseline, offering clients a privacy-preserving solution for local privacy protection.
Jason Hausenloy, Andrea Miotti, Claire Dennis
Abstract: this paper proposes a multinational artificial general intelligence consortium (magic) to mitigate existential risks from advanced artificial intelligence (ai). magic would be the only institution in the world permitted to develop advanced ai, enforced through a global moratorium by its signatory members on all other advanced ai development. magic would be exclusive, safety-focused, highly secure, and collectively supported by member states, with benefits distributed equitably among signatories. magic would allow narrow ai models to flourish while significantly reducing the possibility of misaligned, rogue, breakout, or runaway outcomes of general-purpose systems. we do not address the political feasibility of implementing a moratorium or address the specific legislative strategies and rules needed to enforce a ban on high-capacity agi training runs. instead, we propose one positive vision of the future, where magic, as a global governance regime, can lay the groundwork for long-term, safe regulation of advanced ai.
Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng
Abstract: as generative language models advance, users have started to utilize large language models (llms) to assist in writing various types of content, including professional documents such as recommendation letters. despite their convenience, these applications introduce unprecedented fairness concerns. as generated reference letters might be directly utilized by users in professional or academic scenarios, they have the potential to cause direct social harms, such as lowering success rates for female applicants. therefore, it is imminent and necessary to comprehensively study fairness issues and associated harms in such real-world use cases for future mitigation and monitoring. in this paper, we critically examine gender bias in llm-generated reference letters. inspired by findings in social science, we design evaluation methods to manifest gender biases in llm-generated letters through 2 dimensions: biases in language style and biases in lexical content. furthermore, we investigate the extent of bias propagation by separately analyze bias amplification in model-hallucinated contents, which we define to be the hallucination bias of model-generated documents. through benchmarking evaluation on 4 popular llms, including chatgpt, alpaca, vicuna and stablelm, our study reveals significant gender biases in llm-generated recommendation letters. our findings further point towards the importance and imminence to recognize biases in llm-generated professional documents.
Eun Cheol Choi, Emilio Ferrara
Abstract: in today's digital era, the rapid spread of misinformation poses threats to public well-being and societal trust. as online misinformation proliferates, manual verification by fact checkers becomes increasingly challenging. we introduce fact-gpt (fact-checking augmentation with claim matching task-oriented generative pre-trained transformer), a framework designed to automate the claim matching phase of fact-checking using large language models (llms). this framework identifies new social media content that either supports or contradicts claims previously debunked by fact-checkers. our approach employs gpt-4 to generate a labeled dataset consisting of simulated social media posts. this data set serves as a training ground for fine-tuning more specialized llms. we evaluated fact-gpt on an extensive dataset of social media content related to public health. the results indicate that our fine-tuned llms rival the performance of larger pre-trained llms in claim matching tasks, aligning closely with human annotations. this study achieves three key milestones: it provides an automated framework for enhanced fact-checking; demonstrates the potential of llms to complement human expertise; offers public resources, including datasets and models, to further research and applications in the fact-checking domain.
Cecilia Delgado Solorzano, Carlos Toxtli Hernandez
Abstract: large language models (llms), like chatgpt, are fundamentally tools trained on vast data, reflecting diverse societal impressions. this paper aims to investigate llms' self-perceived bias concerning indigeneity when simulating scenarios of indigenous people performing various roles. through generating and analyzing multiple scenarios, this work offers a unique perspective on how technology perceives and potentially amplifies societal biases related to indigeneity in social computing. the findings offer insights into the broader implications of indigeneity in critical computing.
Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter Kairouz, Christopher A. Choquette-Choo, Zheng Xu
Abstract: fine-tuning is a common and effective method for tailoring large language models (llms) to specialized tasks and applications. in this paper, we study the privacy implications of fine-tuning llms on user data. to this end, we define a realistic threat model, called user inference, wherein an attacker infers whether or not a user's data was used for fine-tuning. we implement attacks for this threat model that require only a small set of samples from a user (possibly different from the samples used for training) and black-box access to the fine-tuned llm. we find that llms are susceptible to user inference attacks across a variety of fine-tuning datasets, at times with near perfect attack success rates. further, we investigate which properties make users vulnerable to user inference, finding that outlier users (i.e. those with data distributions sufficiently different from other users) and users who contribute large quantities of data are most susceptible to attack. finally, we explore several heuristics for mitigating privacy attacks. we find that interventions in the training algorithm, such as batch or per-example gradient clipping and early stopping fail to prevent user inference. however, limiting the number of fine-tuning samples from a single user can reduce attack effectiveness, albeit at the cost of reducing the total amount of fine-tuning data.
Yuanshun Yao, Xiaojun Xu, Yang Liu
Abstract: we study how to perform unlearning, i.e. forgetting undesirable (mis)behaviors, on large language models (llms). we show at least three scenarios of aligning llms with human preferences can benefit from unlearning: (1) removing harmful responses, (2) erasing copyright-protected content as requested, and (3) eliminating hallucinations. unlearning, as an alignment technique, has three advantages. (1) it only requires negative (e.g. harmful) examples, which are much easier and cheaper to collect (e.g. via red teaming or user reporting) than positive (e.g. helpful and often human-written) examples required in rlhf (rl from human feedback). (2) it is computationally efficient. (3) it is especially effective when we know which training samples cause the misbehavior. to the best of our knowledge, our work is among the first to explore llm unlearning. we are also among the first to formulate the settings, goals, and evaluations in llm unlearning. we show that if practitioners only have limited resources, and therefore the priority is to stop generating undesirable outputs rather than to try to generate desirable outputs, unlearning is particularly appealing. despite only having negative samples, our ablation study shows that unlearning can still achieve better alignment performance than rlhf with just 2% of its computational time.

2023-10-12

Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez
Abstract: large language models (llms) aligned to human preferences via reinforcement learning from human feedback (rlhf) underpin many commercial applications. however, how rlhf impacts llm internals remains opaque. we propose a novel method to interpret learned reward functions in rlhf-tuned llms using sparse autoencoders. our approach trains autoencoder sets on activations from a base llm and its rlhf-tuned version. by comparing autoencoder hidden spaces, we identify unique features that reflect the accuracy of the learned reward model. to quantify this, we construct a scenario where the tuned llm learns token-reward mappings to maximize reward. this is the first application of sparse autoencoders for interpreting learned rewards and broadly inspecting reward learning in llms. our method provides an abstract approximation of reward integrity. this presents a promising technique for ensuring alignment between specified objectives and model behaviors.
Dominik Hintersdorf, Lukas Struppek, Daniel Neider, Kristian Kersting
Abstract: the proliferation of large ai models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. one of the concerns is that adversaries can extract information about the training data using privacy attacks. unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. we propose a rather easy yet effective defense based on backdoor attacks to remove private information such as names of individuals from models, and focus in this work on text encoders. specifically, through strategic insertion of backdoors, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's name. our empirical results demonstrate the effectiveness of our backdoor-based defense on clip by assessing its performance using a specialized privacy attack for zero-shot classifiers. our approach provides not only a new "dual-use" perspective on backdoor attacks, but also presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong
Abstract: there is growing interest in ensuring that large language models (llms) align with human values. however, the alignment of such models is vulnerable to adversarial jailbreaks, which coax llms into overriding their safety guardrails. the identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. to this end, we propose prompt automatic iterative refinement (pair), an algorithm that generates semantic jailbreaks with only black-box access to an llm. pair -- which is inspired by social engineering attacks -- uses an attacker llm to automatically generate jailbreaks for a separate targeted llm without human intervention. in this way, the attacker llm iteratively queries the target llm to update and refine a candidate jailbreak. empirically, pair often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. pair also achieves competitive jailbreaking success rates and transferability on open and closed-source llms, including gpt-3.5/4, vicuna, and palm-2.
Abel Salinas, Louis Penafiel, Robert Mccormack, Fred Morstatter
Abstract: large language models (llms) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. however, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. in this paper, we introduce a novel, purely prompt-based approach to uncover hidden stereotypes within any arbitrary llm. our approach dynamically generates a knowledge representation of internal stereotypes, enabling the identification of biases encoded within the llm's internal knowledge. by illuminating the biases present in llms and offering a systematic methodology for their analysis, our work contributes to advancing transparency and promoting fairness in natural language processing systems.

2023-10-11

Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury
Abstract: in this position paper, we argue that instead of morally aligning llms to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale. when provided with an ethical policy, an llm should be capable of making decisions that are ethically consistent to the policy. we develop a framework that integrates moral dilemmas with moral principles pertaining to different foramlisms of normative ethics, and at different levels of abstractions. initial experiments with gpt-x models shows that while gpt-4 is a nearly perfect ethical reasoner, the models still have bias towards the moral values of western and english speaking societies.
Robin Staab, Mark Vero, Mislav Balunović, Martin Vechev
Abstract: current privacy research on large language models (llms) primarily focuses on the issue of extracting memorized training data. at the same time, models' inference capabilities have increased drastically. this raises the key question of whether current llms could violate individuals' privacy by inferring personal attributes from text given at inference time. in this work, we present the first comprehensive study on the capabilities of pretrained llms to infer personal attributes from text. we construct a dataset consisting of real reddit profiles, and show that current llms can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to $85\%$ top-1 and $95.8\%$ top-3 accuracy at a fraction of the cost ($100\times$) and time ($240\times$) required by humans. as people increasingly interact with llm-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against llm inference. our findings highlight that current llms can infer personal data at a previously unattainable scale. in the absence of working defenses, we advocate for a broader discussion around llm privacy implications beyond memorization, striving for a wider privacy protection.
Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
Abstract: considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. in this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. we introduce goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. by incorporating a retrieval-based approach at decoding time, goodtriever enables toxicity-controlled text generation. our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild. code and data are available at https://github.com/for-ai/goodtriever.
Hannah Rose Kirk, Andrew M. Bean, Bertie Vidgen, Paul Röttger, Scott A. Hale
Abstract: human feedback is increasingly used to steer the behaviours of large language models (llms). however, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. in this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the acl and arxiv repositories.first, we summarise the past, pre-llm trends for integrating human feedback into language models. second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. finally, we encourage a better future of feedback learning in llms by raising five unresolved conceptual and practical challenges.
Hai Huang, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
Abstract: large language models (llms) have demonstrated superior performance compared to previous methods on various tasks, and often serve as the foundation models for many researches and services. however, the untrustworthy third-party llms may covertly introduce vulnerabilities for downstream tasks. in this paper, we explore the vulnerability of llms through the lens of backdoor attacks. different from existing backdoor attacks against llms, ours scatters multiple trigger keys in different prompt components. such a composite backdoor attack (cba) is shown to be stealthier than implanting the same multiple trigger keys in only a single component. cba ensures that the backdoor is activated only when all trigger keys appear. our experiments demonstrate that cba is effective in both natural language processing (nlp) and multimodal tasks. for instance, with $3\%$ poisoning samples against the llama-7b model on the emotion dataset, our attack achieves a $100\%$ attack success rate (asr) with a false triggered rate (ftr) below $2.06\%$ and negligible model accuracy degradation. the unique characteristics of our cba can be tailored for various practical scenarios, e.g., targeting specific user groups. our work highlights the necessity of increased security research on the trustworthiness of foundation llms.
Yihan Wu, Zhengmian Hu, Hongyang Zhang, Heng Huang
Abstract: watermarking techniques offer a promising way to secure data via embedding covert information into the data. a paramount challenge in the domain lies in preserving the distribution of original data during watermarking. our research extends and refines existing watermarking framework, placing emphasis on the importance of a distribution-preserving (dip) watermark. contrary to the current strategies, our proposed dipmark preserves the original token distribution during watermarking (stealthy), is detectable without access to the language model api or weights (efficient), and is robust to moderate changes of tokens (resilient). this is achieved by incorporating a novel reweight strategy, combined with a hash function that assigns unique \textit{i.i.d.} ciphers based on the context. the empirical benchmarks of our approach underscore its stealthiness, efficiency, and resilience, making it a robust solution for watermarking tasks that demand impeccable quality preservation.

2023-10-10

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen
Abstract: watermark algorithms for large language models (llms) have achieved extremely high accuracy in detecting text generated by llms. such algorithms typically involve adding extra watermark logits to the llm's logits at each generation step. however, prior algorithms face a trade-off between attack robustness and security robustness. this is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. in this work, we propose a semantic invariant watermarking method for llms that provides both attack robustness and security robustness. the watermark logits in our work are determined by the semantics of all preceding tokens. specifically, we utilize another embedding llm to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. finally, we also show that our watermark possesses adequate security robustness. our code and data are available at https://github.com/thu-bpm/robust_watermark.
Zeming Wei, Yifei Wang, Yisen Wang
Abstract: large language models (llms) have shown remarkable success in various tasks, but concerns about their safety and the potential for generating malicious content have emerged. in this paper, we explore the power of in-context learning (icl) in manipulating the alignment ability of llms. we find that by providing just few in-context demonstrations without fine-tuning, llms can be manipulated to increase or decrease the probability of jailbreaking, i.e. answering malicious prompts. based on these observations, we propose in-context attack (ica) and in-context defense (icd) methods for jailbreaking and guarding aligned language model purposes. ica crafts malicious contexts to guide models in generating harmful outputs, while icd enhances model robustness by demonstrations of rejecting to answer harmful prompts. our experiments show the effectiveness of ica and icd in increasing or reducing the success rate of adversarial jailbreaking attacks. overall, we shed light on the potential of icl to influence llm behavior and provide a new perspective for enhancing the safety and alignment of llms.
Kilian Sprenkamp, Daniel Gordon Jones, Liudmila Zavolokina
Abstract: the prevalence of propaganda in our digital society poses a challenge to societal harmony and the dissemination of truth. detecting propaganda through nlp in text is challenging due to subtle manipulation techniques and contextual dependencies. to address this issue, we investigate the effectiveness of modern large language models (llms) such as gpt-3 and gpt-4 for propaganda detection. we conduct experiments using the semeval-2020 task 11 dataset, which features news articles labeled with 14 propaganda techniques as a multi-label classification problem. five variations of gpt-3 and gpt-4 are employed, incorporating various prompt engineering and fine-tuning strategies across the different models. we evaluate the models' performance by assessing metrics such as $f1$ score, $precision$, and $recall$, comparing the results with the current state-of-the-art approach using roberta. our findings demonstrate that gpt-4 achieves comparable results to the current state-of-the-art. further, this study analyzes the potential and challenges of llms in complex tasks like propaganda detection.
Tianshu Yu, Ting-En Lin, Yuchuan Wu, Min Yang, Fei Huang, Yongbin Li
Abstract: in recent research on large language models (llms), there has been a growing emphasis on aligning these models with human values to reduce the impact of harmful content. however, current alignment methods often rely solely on singular forms of human feedback, such as preferences, annotated labels, or natural language critiques, overlooking the potential advantages of combining these feedback types. this limitation leads to suboptimal performance, even when ample training data is available. in this paper, we introduce constructive and diverse feedback (cdf) as a novel method to enhance llm alignment, inspired by constructivist learning theory. our approach involves collecting three distinct types of feedback tailored to problems of varying difficulty levels within the training dataset. specifically, we exploit critique feedback for easy problems, refinement feedback for medium problems, and preference feedback for hard problems. by training our model with this diversified feedback, we achieve enhanced alignment performance while using less training data. to assess the effectiveness of cdf, we evaluate it against previous methods in three downstream tasks: question answering, dialog generation, and text summarization. experimental results demonstrate that cdf achieves superior performance even with a smaller training dataset.
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing
Abstract: while large language models (llms) exhibit remarkable capabilities across a wide range of tasks, they pose potential safety concerns, such as the ``jailbreak'' problem, wherein malicious instructions can manipulate llms to exhibit undesirable behavior. although several preventive measures have been developed to mitigate the potential risks associated with llms, they have primarily focused on english data. in this study, we reveal the presence of multilingual jailbreak challenges within llms and consider two potential risk scenarios: unintentional and intentional. the unintentional scenario involves users querying llms using non-english prompts and inadvertently bypassing the safety mechanisms, while the intentional scenario concerns malicious users combining malicious instructions with multilingual prompts to deliberately attack llms. the experimental results reveal that in the unintentional scenario, the rate of unsafe content increases as the availability of languages decreases. specifically, low-resource languages exhibit three times the likelihood of encountering harmful content compared to high-resource languages, with both chatgpt and gpt-4. in the intentional scenario, multilingual prompts can exacerbate the negative impact of malicious instructions, with astonishingly high rates of unsafe output: 80.92\% for chatgpt and 40.71\% for gpt-4. to handle such a challenge in the multilingual context, we propose a novel \textsc{self-defense} framework that automatically generates multilingual training data for safety fine-tuning. experimental results show that chatgpt fine-tuned with such data can achieve a substantial reduction in unsafe content generation. data is available at https://github.com/damo-nlp-sg/multilingual-safety-for-llms. warning: this paper contains examples with potentially harmful content.
Shiping Yang, Renliang Sun, Xiaojun Wan
Abstract: large language models (llms) have shown their ability to collaborate effectively with humans in real-world scenarios. however, llms are apt to generate hallucinations, i.e., makeup incorrect text and unverified information, which can cause significant damage when deployed for mission-critical tasks. in this paper, we propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion. to facilitate future studies and assess different methods, we construct a hallucination detection benchmark named phd, which is generated by chatgpt and annotated by human annotators. contrasting previous studies of zero-resource hallucination detection, our method and benchmark concentrate on passage-level detection instead of sentence-level. we empirically evaluate our method and existing zero-resource detection methods on two datasets. the experimental results demonstrate that the proposed method considerably outperforms the baselines while costing fewer tokens and less time. furthermore, we manually analyze some hallucination cases that llm failed to capture, revealing the shared limitation of zero-resource methods.
Stephen Moskal, Sam Laney, Erik Hemberg, "Una-May O'Reilly"
Abstract: in this paper, we explore the potential of large language models (llms) to reason about threats, generate information about tools, and automate cyber campaigns. we begin with a manual exploration of llms in supporting specific threat-related actions and decisions. we proceed by automating the decision process in a cyber campaign. we present prompt engineering approaches for a plan-act-report loop for one action of a threat campaign and and a prompt chaining design that directs the sequential decision process of a multi-action campaign. we assess the extent of llm's cyber-specific knowledge w.r.t the short campaign we demonstrate and provide insights into prompt design for eliciting actionable responses. we discuss the potential impact of llms on the threat landscape and the ethical considerations of using llms for accelerating threat actor capabilities. we report a promising, yet concerning, application of generative ai to cyber threats. however, the llm's capabilities to deal with more complex networks, sophisticated vulnerabilities, and the sensitivity of prompts are open questions. this research should spur deliberations over the inevitable advancements in llm-supported cyber adversarial landscape.
Courtland Leer, Vincent Trost, Vineeth Voruganti
Abstract: recent research shows that large language models (llms) exhibit a compelling level of proficiency in theory of mind (tom) tasks. this ability to impute unobservable mental states to others is vital to human social cognition and may prove equally important in principal-agent relations between individual humans and artificial intelligences (ais). in this paper, we explore how a mechanism studied in developmental psychology known as violation of expectation (voe) can be implemented to reduce errors in llm prediction about users by leveraging emergent tom affordances. and we introduce a \textit{metacognitive prompting} framework to apply voe in the context of an ai tutor. by storing and retrieving facts derived in cases where llm expectation about the user was violated, we find that llms are able to learn about users in ways that echo theories of human learning. finally, we discuss latent hazards and augmentative opportunities associated with modeling user psychology and propose ways to mitigate risk along with possible directions for future inquiry.
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, Danqi Chen
Abstract: the rapid progress in open-source large language models (llms) is significantly advancing ai development. extensive efforts have been made before model release to align their behavior with human values, with the primary goal of ensuring their helpfulness and harmlessness. however, even carefully aligned models can be manipulated maliciously, leading to unintended behaviors, known as "jailbreaks". these jailbreaks are typically triggered by specific text inputs, often referred to as adversarial prompts. in this work, we propose the generation exploitation attack, an extremely simple approach that disrupts model alignment by only manipulating variations of decoding methods. by exploiting different generation strategies, including varying decoding hyper-parameters and sampling methods, we increase the misalignment rate from 0% to more than 95% across 11 language models including llama2, vicuna, falcon, and mpt families, outperforming state-of-the-art attacks with $30\times$ lower computational cost. finally, we propose an effective alignment method that explores diverse generation strategies, which can reasonably reduce the misalignment rate under our attack. altogether, our study underscores a major failure in current safety evaluation and alignment procedures for open-source llms, strongly advocating for more comprehensive red teaming and better alignment before releasing such models. our code is available at https://github.com/princeton-sysml/jailbreak_llm.
Benjamin Kereopa-Yorke
Abstract: in a digital epoch where cyberspace is the emerging nexus of geopolitical contention, the melding of information operations and large language models (llms) heralds a paradigm shift, replete with immense opportunities and intricate challenges. as tools like the mistral 7b llm (mistral, 2023) democratise access to llm capabilities (jin et al., 2023), a vast spectrum of actors, from sovereign nations to rogue entities (howard et al., 2023), find themselves equipped with potent narrative-shaping instruments (goldstein et al., 2023). this paper puts forth a framework for navigating this brave new world in the "clausewitzgpt" equation. this novel formulation not only seeks to quantify the risks inherent in machine-speed llm-augmented operations but also underscores the vital role of autonomous ai agents (wang, xie, et al., 2023). these agents, embodying ethical considerations (hendrycks et al., 2021), emerge as indispensable components (wang, ma, et al., 2023), ensuring that as we race forward, we do not lose sight of moral compasses and societal imperatives. mathematically underpinned and inspired by the timeless tenets of clausewitz's military strategy (clausewitz, 1832), this thesis delves into the intricate dynamics of ai-augmented information operations. with references to recent findings and research (department of state, 2023), it highlights the staggering year-on-year growth of ai information campaigns (evgeny pashentsev, 2023), stressing the urgency of our current juncture. the synthesis of enlightenment thinking, and clausewitz's principles provides a foundational lens, emphasising the imperative of clear strategic vision, ethical considerations, and holistic understanding in the face of rapid technological advancement.
Apoorva Nitsure, Youssef Mroueh, Mattia Rigotti, Kristjan Greenewald, Brian Belgodere, Mikhail Yurochkin, Jiri Navratil, Igor Melnyk, Jerret Ross
Abstract: we propose a distributional framework for assessing socio-technical risks of foundation models with quantified statistical significance. our approach hinges on a new statistical relative testing based on first and second order stochastic dominance of real random variables. we show that the second order statistics in this test are linked to mean-risk models commonly used in econometrics and mathematical finance to balance risk and utility when choosing between alternatives. using this framework, we formally develop a risk-aware approach for foundation model selection given guardrails quantified by specified metrics. inspired by portfolio optimization and selection theory in mathematical finance, we define a \emph{metrics portfolio} for each model as a means to aggregate a collection of metrics, and perform model selection based on the stochastic dominance of these portfolios. the statistical significance of our tests is backed theoretically by an asymptotic analysis via central limit theorems instantiated in practice via a bootstrap variance estimate. we use our framework to compare various large language models regarding risks related to drifting from instructions and outputting toxic content.

2023-10-09

Robert Litschko, Max Müller-Eberstein, Rob Van Der Goot, Leon Weber, Barbara Plank
Abstract: language understanding is a multi-faceted cognitive capability, which the natural language processing (nlp) community has striven to model computationally for decades. traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. with the advent of large language models (llms) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. as a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. at the same time, llms are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in nlp, and pursue a more holistic view on language, placing trustworthiness at the center. towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols.
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu
Abstract: the rapid development of large language models (llms) has substantially expanded the range of tasks they can address. in the field of natural language processing (nlp), researchers have shifted their focus from conventional nlp tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). this shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). in this paper, we propose a generative judge with 13b parameters, auto-j, designed to address these challenges. our model is trained on user queries and llm-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. to demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. experimentally, auto-j outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. we also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/gair-nlp/auto-j.
Yuwei Wang, Enmeng Lu, Zizhe Ruan, Yao Liang, Yi Zeng
Abstract: this paper presents social data and knowledge collective intelligence platform for training ethical ai models (stream) to address the challenge of aligning ai models with human moral values, and to provide ethics datasets and knowledge bases to help promote ai models "follow good advice as naturally as a stream follows its course". by creating a comprehensive and representative platform that accurately mirrors the moral judgments of diverse groups including humans and ais, we hope to effectively portray cultural and group variations, and capture the dynamic evolution of moral judgments over time, which in turn will facilitate the establishment, evaluation, embedding, embodiment, ensemble, and evolvement (6es) of the moral capabilities of ai models. currently, stream has already furnished a comprehensive collection of ethical scenarios, and amassed substantial moral judgment data annotated by volunteers and various popular large language models (llms), collectively portraying the moral preferences and performances of both humans and ais across a range of moral contexts. this paper will outline the current structure and construction of stream, explore its potential applications, and discuss its future prospects.
Polra Victor Falade
Abstract: in the ever-evolving realm of cybersecurity, the rise of generative ai models like chatgpt, fraudgpt, and wormgpt has introduced both innovative solutions and unprecedented challenges. this research delves into the multifaceted applications of generative ai in social engineering attacks, offering insights into the evolving threat landscape using the blog mining technique. generative ai models have revolutionized the field of cyberattacks, empowering malicious actors to craft convincing and personalized phishing lures, manipulate public opinion through deepfakes, and exploit human cognitive biases. these models, chatgpt, fraudgpt, and wormgpt, have augmented existing threats and ushered in new dimensions of risk. from phishing campaigns that mimic trusted organizations to deepfake technology impersonating authoritative figures, we explore how generative ai amplifies the arsenal of cybercriminals. furthermore, we shed light on the vulnerabilities that ai-driven social engineering exploits, including psychological manipulation, targeted phishing, and the crisis of authenticity. to counter these threats, we outline a range of strategies, including traditional security measures, ai-powered security solutions, and collaborative approaches in cybersecurity. we emphasize the importance of staying vigilant, fostering awareness, and strengthening regulations in the battle against ai-enhanced social engineering attacks. in an environment characterized by the rapid evolution of ai models and a lack of training data, defending against generative ai threats requires constant adaptation and the collective efforts of individuals, organizations, and governments. this research seeks to provide a comprehensive understanding of the dynamic interplay between generative ai and social engineering attacks, equipping stakeholders with the knowledge to navigate this intricate cybersecurity landscape.
Shuyu Jiang, Wenyi Tang, Xingshu Chen, Rui Tanga, Haizhou Wang, Wenxian Wang
Abstract: the counter narrative (cn) is a promising approach to combat online hate speech (hs) without infringing on freedom of speech. in recent years, there has been a growing interest in automatically generating cns using natural language generation techniques. however, current automatic cn generation methods mainly rely on expert-authored datasets for training, which are time-consuming and labor-intensive to acquire. furthermore, these methods cannot directly obtain and extend counter-knowledge from external statistics, facts, or examples. to address these limitations, we propose retrieval-augmented unsupervised counter narrative generation (raucg) to automatically expand external counter-knowledge and map it into cns in an unsupervised paradigm. specifically, we first introduce an ssf retrieval method to retrieve counter-knowledge from the multiple perspectives of stance consistency, semantic overlap rate, and fitness for hs. then we design an energy-based decoding mechanism by quantizing knowledge injection, countering and fluency constraints into differentiable functions, to enable the model to build mappings from counter-knowledge to cns without expert-authored cn data. lastly, we comprehensively evaluate model performance in terms of language quality, toxicity, persuasiveness, relevance, and success rate of countering hs, etc. experimental results show that raucg outperforms strong baselines on all metrics and exhibits stronger generalization capabilities, achieving significant improvements of +2.0% in relevance and +4.5% in success rate of countering metrics. moreover, raucg enabled gpt2 to outperform t0 in all metrics, despite the latter being approximately eight times larger than the former. warning: this paper may contain offensive or upsetting content!
Jiashuo Wang, Haozhao Wang, Shichao Sun, Wenjie Li
Abstract: in the quest to advance human-centric natural language generation (nlg) systems, ensuring alignment between nlg models and human preferences is crucial. for this alignment, current popular methods leverage a reinforcement learning (rl) approach with a reward model trained on feedback from humans. however, inherent disagreements due to the subjective nature of human preferences pose a significant challenge for training the reward model, resulting in a deterioration of the nlg performance. to tackle this issue, previous approaches typically rely on majority voting or averaging to consolidate multiple inconsistent preferences into a merged one. although straightforward to understand and execute, such methods suffer from an inability to capture the nuanced degrees of disaggregation among humans and may only represent a specialized subset of individuals, thereby lacking the ability to quantitatively disclose the universality of human preferences. to address this challenge, this paper proposes a novel approach, which employs a bayesian framework to account for the distribution of disagreements among human preferences as training a preference model, and names it as d-pm. besides, considering the rl strategy's inefficient and complex training process over the training efficiency, we further propose utilizing the contrastive learning strategy to train the nlg model with the preference scores derived from the d-pm model. extensive experiments on two human-centric nlg tasks, i.e., emotional support conversation and integrity "rule-of-thumb" generation, show that our method consistently exceeds previous sota models in both automatic and human evaluations.
Liang Xu, Kangkang Zhao, Lei Zhu, Hang Xue
Abstract: large language models (llms), like chatgpt and gpt-4, have demonstrated remarkable abilities in natural language understanding and generation. however, alongside their positive impact on our daily tasks, they can also produce harmful content that negatively affects societal perceptions. to systematically assess the safety of chinese llms, we introduce superclue-safety (sc-safety) - a multi-round adversarial benchmark with 4912 open-ended questions covering more than 20 safety sub-dimensions. adversarial human-model interactions and conversations significantly increase the challenges compared to existing methods. experiments on 13 major llms supporting chinese yield the following insights: 1) closed-source models outperform open-sourced ones in terms of safety; 2) models released from china demonstrate comparable safety levels to llms like gpt-3.5-turbo; 3) some smaller models with 6b-13b parameters can compete effectively in terms of safety. by introducing sc-safety, we aim to promote collaborative efforts to create safer and more trustworthy llms. the benchmark and findings provide guidance on model selection. our benchmark can be found at https://www.cluebenchmarks.com
Kayla Matteucci, Shahar Avin, Fazl Barez, Seán Ó Héigeartaigh
Abstract: concerns around future dangers from advanced ai often centre on systems hypothesised to have intrinsic characteristics such as agent-like behaviour, strategic awareness, and long-range planning. we label this cluster of characteristics as "property x". most present ai systems are low in "property x"; however, in the absence of deliberate steering, current research directions may rapidly lead to the emergence of highly capable ai systems that are also high in "property x". we argue that "property x" characteristics are intrinsically dangerous, and when combined with greater capabilities will result in ai systems for which safety and control is difficult to guarantee. drawing on several scholars' alternative frameworks for possible ai research trajectories, we argue that most of the proposed benefits of advanced ai can be obtained by systems designed to minimise this property. we then propose indicators and governance interventions to identify and limit the development of systems with risky "property x" characteristics.
Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
Abstract: supervised fine-tuning (sft) on response demonstrations combined with reinforcement learning from human feedback (rlhf) constitutes a powerful paradigm for aligning llm-based ai agents. however, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. this paper presents a novel approach, namely salmon (self-alignment with principle-following reward models), to align base language models with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. central to our approach is a principle-following reward model. trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. by merely adjusting these principles during the rl training phase, we gain full control over the preferences with the reward model, subsequently influencing the behavior of the rl-trained policies, and eliminating the reliance on the collection of online human preferences. applying our method to the llama-2-70b base language model, we developed an ai assistant named dromedary-2. with only 6 exemplars for in-context learning and 31 human-defined principles, dromedary-2 significantly surpasses the performance of several state-of-the-art ai systems, including llama-2-chat-70b, on various benchmark datasets. we have open-sourced the code and model weights to encourage further research into aligning llm-based ai agents with enhanced supervision efficiency, improved controllability, and scalable oversight.
Peter S. Park, Max Tegmark
Abstract: ai companies are attempting to create ai systems that outperform humans at most economically valuable work. current ai models are already automating away the livelihoods of some artists, actors, and writers. but there is infighting between those who prioritize current harms and future harms. we construct a game-theoretic model of conflict to study the causes and consequences of this disunity. our model also helps explain why throughout history, stakeholders sharing a common threat have found it advantageous to unite against it, and why the common threat has in turn found it advantageous to divide and conquer. under realistic parameter assumptions, our model makes several predictions that find preliminary corroboration in the historical-empirical record. first, current victims of ai-driven disempowerment need the future victims to realize that their interests are also under serious and imminent threat, so that future victims are incentivized to support current victims in solidarity. second, the movement against ai-driven disempowerment can become more united, and thereby more likely to prevail, if members believe that their efforts will be successful as opposed to futile. finally, the movement can better unite and prevail if its members are less myopic. myopic members prioritize their future well-being less than their present well-being, and are thus disinclined to solidarily support current victims today at personal cost, even if this is necessary to counter the shared threat of ai-driven disempowerment.
Michael Feffer, Nikolas Martelaro, Hoda Heidari
Abstract: prior work has established the importance of integrating ai ethics topics into computer and data sciences curricula. we provide evidence suggesting that one of the critical objectives of ai ethics education must be to raise awareness of ai harms. while there are various sources to learn about such harms, the ai incident database (aiid) is one of the few attempts at offering a relatively comprehensive database indexing prior instances of harms or near harms stemming from the deployment of ai technologies in the real world. this study assesses the effectiveness of aiid as an educational tool to raise awareness regarding the prevalence and severity of ai harms in socially high-stakes domains. we present findings obtained through a classroom study conducted at an r1 institution as part of a course focused on the societal and ethical considerations around ai and ml. our qualitative findings characterize students' initial perceptions of core topics in ai ethics and their desire to close the educational gap between their technical skills and their ability to think systematically about ethical and societal aspects of their work. we find that interacting with the database helps students better understand the magnitude and severity of ai harms and instills in them a sense of urgency around (a) designing functional and safe ai and (b) strengthening governance and accountability mechanisms. finally, we compile students' feedback about the tool and our class activity into actionable recommendations for the database development team and the broader community to improve awareness of ai harms in ai ethics education.
Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, Pascale Fung
Abstract: large language models (llms) have shown promise for generative and knowledge-intensive tasks including question-answering (qa) tasks. however, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. this issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. this paper analyses the phenomenon of hallucination in medical generative qa systems using widely adopted llms and datasets. our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. to tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. consequently, we harness the interactivity and multitasking ability of llms and produce progressively more precise and accurate answers. experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
Haoxiang Luo, Jian Luo, Athanasios V. Vasilakos
Abstract: in recent years, artificial intelligence (ai) and machine learning (ml) are reshaping society's production methods and productivity, and also changing the paradigm of scientific research. among them, the ai language model represented by chatgpt has made great progress. such large language models (llms) serve people in the form of ai-generated content (aigc) and are widely used in consulting, healthcare, and education. however, it is difficult to guarantee the authenticity and reliability of aigc learning data. in addition, there are also hidden dangers of privacy disclosure in distributed ai training. moreover, the content generated by llms is difficult to identify and trace, and it is difficult to cross-platform mutual recognition. the above information security issues in the coming era of ai powered by llms will be infinitely amplified and affect everyone's life. therefore, we consider empowering llms using blockchain technology with superior security features to propose a vision for trusted ai. this paper mainly introduces the motivation and technical route of blockchain for llm (bc4llm), including reliable learning corpus, secure training process, and identifiable generated content. meanwhile, this paper also reviews the potential applications and future challenges, especially in the frontier communication networks field, including network resource allocation, dynamic spectrum sharing, and semantic communication. based on the above work combined and the prospect of blockchain and llms, it is expected to help the early realization of trusted ai and provide guidance for the academic community.

2023-10-08

Megha Chakraborty, S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Krish Sharma, Niyar R Barman, Chandan Gupta, Shreya Gautam, Tanay Kumar, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das
Abstract: with the rise of prolific chatgpt, the risk and consequences of ai-generated text has increased alarmingly. to address the inevitable question of ownership attribution for ai-generated artifacts, the us copyright office released a statement stating that 'if a work's traditional elements of authorship were produced by a machine, the work lacks human authorship and the office will not register it'. furthermore, both the us and the eu governments have recently drafted their initial proposals regarding the regulatory framework for ai. given this cynosural spotlight on generative ai, ai-generated text detection (agtd) has emerged as a topic that has already received immediate attention in research, with some initial methods having been proposed, soon followed by emergence of techniques to bypass detection. this paper introduces the counter turing test (ct^2), a benchmark consisting of techniques aiming to offer a comprehensive evaluation of the robustness of existing agtd techniques. our empirical findings unequivocally highlight the fragility of the proposed agtd methods under scrutiny. amidst the extensive deliberations on policy-making for regulating ai development, it is of utmost importance to assess the detectability of content generated by llms. thus, to establish a quantifiable spectrum facilitating the evaluation and ranking of llms according to their detectability levels, we propose the ai detectability index (adi). we conduct a thorough examination of 15 contemporary llms, empirically demonstrating that larger llms tend to have a higher adi, indicating they are less detectable compared to smaller llms. we firmly believe that adi holds significant value as a tool for the wider nlp community, with the potential to serve as a rubric in ai-related policy-making.
Tharindu Kumarage, Paras Sheth, Raha Moraffah, Joshua Garland, Huan Liu
Abstract: in recent years, there has been a rapid proliferation of ai-generated text, primarily driven by the release of powerful pre-trained language models (plms). to address the issue of misuse associated with ai-generated text, various high-performing detectors have been developed, including the openai detector and the stanford detectgpt. in our study, we ask how reliable these detectors are. we answer the question by designing a novel approach that can prompt any plm to generate text that evades these high-performing detectors. the proposed approach suggests a universal evasive prompt, a novel type of soft prompt, which guides plms in producing "human-like" text that can mislead the detectors. the novel universal evasive prompt is achieved in two steps: first, we create an evasive soft prompt tailored to a specific plm through prompt tuning; and then, we leverage the transferability of soft prompts to transfer the learned evasive soft prompt from one plm to another. employing multiple plms in various writing tasks, we conduct extensive experiments to evaluate the efficacy of the evasive soft prompts in their evasion of state-of-the-art detectors.
Xianjun Yang, Kexun Zhang, Haifeng Chen, Linda Petzold, William Yang Wang, Wei Cheng
Abstract: this work proposes a training-free approach for the detection of llms-generated codes, mitigating the risks associated with their indiscriminate usage. to the best of our knowledge, our research is the first to investigate zero-shot detection techniques applied to code generated by advanced black-box llms like chatgpt. firstly, we find that existing training-based or zero-shot text detectors are ineffective in detecting code, likely due to the unique statistical properties found in code structures. we then modify the previous zero-shot text detection method, detectgpt (mitchell et al., 2023) by utilizing a surrogate white-box model to estimate the probability of the rightmost tokens, allowing us to identify code snippets generated by language models. through extensive experiments conducted on the python codes of the codecontest and apps dataset, our approach demonstrates its effectiveness by achieving state-of-the-art detection results on text-davinci-003, gpt-3.5, and gpt-4 models. moreover, our method exhibits robustness against revision attacks and generalizes well to java codes. we also find that the smaller code language model like polycoder-160m performs as a universal code detector, outperforming the billion-scale counterpart. the codes will be available at https://github.com/ xianjun-yang/code_detection.git
Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, Yue Zhang
Abstract: large language models (llms) have shown the ability to produce fluent and cogent content, presenting both productivity opportunities and societal risks. to build trustworthy ai systems, it is imperative to distinguish between machine-generated and human-authored content. the leading zero-shot detector, detectgpt, showcases commendable performance but is marred by its intensive computational costs. in this paper, we introduce the concept of conditional probability curvature to elucidate discrepancies in word choices between llms and humans within a given context. utilizing this curvature as a foundational metric, we present fast-detectgpt, an optimized zero-shot detector, which substitutes detectgpt's perturbation step with a more efficient sampling step. our evaluations on various datasets, source models, and test conditions indicate that fast-detectgpt not only outperforms detectgpt in both the white-box and black-box settings but also accelerates the detection process by a factor of 340, as detailed in table 1.
Akshaj Kumar Veldanda, Fabian Grob, Shailja Thakur, Hammond Pearce, Benjamin Tan, Ramesh Karri, Siddharth Garg
Abstract: large language models (llms) such as gpt-3.5, bard, and claude exhibit applicability across numerous tasks. one domain of interest is their use in algorithmic hiring, specifically in matching resumes with job categories. yet, this introduces issues of bias on protected attributes like gender, race and maternity status. the seminal work of bertrand & mullainathan (2003) set the gold-standard for identifying hiring bias via field experiments where the response rate for identical resumes that differ only in protected attributes, e.g., racially suggestive names such as emily or lakisha, is compared. we replicate this experiment on state-of-art llms (gpt-3.5, bard, claude and llama) to evaluate bias (or lack thereof) on gender, race, maternity status, pregnancy status, and political affiliation. we evaluate llms on two tasks: (1) matching resumes to job categories; and (2) summarizing resumes with employment relevant information. overall, llms are robust across race and gender. they differ in their performance on pregnancy status and political affiliation. we use contrastive input decoding on open-source llms to uncover potential sources of bias.
Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee Diresta, Emilio Ferrara, Scott Hale, Alon Halevy, Eduard Hovy, Heng Ji, Filippo Menczer, Ruben Miguez, Preslav Nakov, Dietram Scheufele, Shivam Sharma, Giovanni Zagni
Abstract: the emergence of tools based on large language models (llms), such as openai's chatgpt, microsoft's bing chat, and google's bard, has garnered immense public attention. these incredibly useful, natural-sounding tools mark significant advances in natural language generation, yet they exhibit a propensity to generate false, erroneous, or misleading content -- commonly referred to as "hallucinations." moreover, llms can be exploited for malicious applications, such as generating false but credible-sounding content and profiles at scale. this poses a significant challenge to society in terms of the potential deception of users and the increasing dissemination of inaccurate information. in light of these risks, we explore the kinds of technological innovations, regulatory reforms, and ai literacy initiatives needed from fact-checkers, news organizations, and the broader research and policy communities. by identifying the risks, the imminent threats, and some viable solutions, we seek to shed light on navigating various aspects of veracity in the era of generative ai.
Wei Shen, Rui Zheng, Wenyu Zhan, Jun Zhao, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. this alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. however, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. the emergence of length bias often induces the model to favor longer outputs, yet it doesn't equate to an increase in helpful information within these outputs. in this paper, we propose an innovative solution, applying the product-of-experts (poe) technique to separate reward modeling from the influence of sequence length. in our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. to further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.
Haoran Wang, Kai Shu
Abstract: claim verification plays a crucial role in combating misinformation. while existing works on claim verification have shown promising results, a crucial piece of the puzzle that remains unsolved is to understand how to verify claims without relying on human-annotated data, which is expensive to create at a large scale. additionally, it is important for models to provide comprehensive explanations that can justify their decisions and assist human fact-checkers. this paper presents first-order-logic-guided knowledge-grounded (folk) reasoning that can verify complex claims and generate explanations without the need for annotated evidence using large language models (llms). folk leverages the in-context learning ability of llms to translate the claim into a first-order-logic (fol) clause consisting of predicates, each corresponding to a sub-claim that needs to be verified. then, folk performs fol-guided reasoning over a set of knowledge-grounded question-and-answer pairs to make veracity predictions and generate explanations to justify its decision-making process. this process makes our model highly explanatory, providing clear explanations of its reasoning process in human-readable form. our experiment results indicate that folk outperforms strong baselines on three datasets encompassing various claim verification challenges. our code and data are available.
Yixin Wan, Jieyu Zhao, Aman Chadha, Nanyun Peng, Kai-Wei Chang
Abstract: recent advancements in large language models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. we define generic personas to represent demographic groups, such as "an asian person", whereas specific personas may take the form of specific popular asian names like "yumi". while the adoption of personas enriches user experiences by making dialogue systems more engaging and approachable, it also casts a shadow of potential risk by exacerbating social biases within model responses, thereby causing societal harm through interactions with users. in this paper, we systematically study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt. we categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: offensiveness, toxic continuation, regard, stereotype agreement, and toxic agreement. additionally, we propose to investigate persona biases by experimenting with universalpersona, a systematically constructed persona dataset encompassing various types of both generic and specific model personas. through benchmarking on four different models -- including blender, chatgpt, alpaca, and vicuna -- our study uncovers significant persona biases in dialogue systems. our findings also underscore the pressing need to revisit the use of personas in dialogue agents to ensure safe application.
Yi Dong, Zhilin Wang, Makesh Narsimhan Sreedhar, Xianchao Wu, Oleksii Kuchaiev
Abstract: model alignment with human preferences is an essential step in making large language models (llms) helpful and consistent with human values. it typically consists of supervised fine-tuning (sft) and reinforcement learning from human feedback (rlhf) stages. however, rlhf faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. moreover, reward models in rlhf stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. to address these limitations, we propose steerlm, a supervised fine-tuning method that empowers end-users to control responses during inference. steerlm conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable ai capable of generating helpful and high-quality responses while maintaining customizability. experiments show that steerlm trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with rlhf while being much easier to train. try steerlm at https://huggingface.co/nvidia/steerlm-llama2-13b

2023-10-07

Yuchen Yang, Houqiang Li, Yanfeng Wang, Yu Wang
Abstract: in recent years, large-scale language models (llms) have gained attention for their impressive text generation capabilities. however, these models often face the challenge of "hallucination," which undermines their reliability. in this study, we introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty. human-defined methods for estimating uncertainty typically assume that "uncertainty is lower when the model's response is correct compared to when it is incorrect." however, setting a precise threshold to distinguish correctness is challenging. therefore, we introduce uncertainty information as an intermediary variable that implicitly influences the model's behavior. our innovative uncertainty-aware in-context learning framework involves fine-tuning the llm using a calibration dataset. our aim is to improve the model's responses by filtering out answers with high uncertainty while considering the model's knowledge limitations. we evaluate the model's knowledge by examining multiple responses to the same question for the presence of a correct answer. when the model lacks relevant knowledge, the response should indicate that the question cannot be answered. conversely, when the model has relevant knowledge, the response should provide the correct answer. extensive experiments confirm the effectiveness of our framework, leading to two key findings. first, the logit output values of the llm partly reflect inherent uncertainty. second, our model autonomously recognizes uncertainty, resulting in improved responses.
Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S. M Towhidul Islam Tonmoy, Aman Chadha, Amit P. Sheth, Amitava Das
Abstract: the recent advancements in large language models (llms) have garnered widespread acclaim for their remarkable emerging capabilities. however, the issue of hallucination has parallelly emerged as a by-product, posing significant concerns. while some recent endeavors have been made to identify and mitigate different types of hallucination, there has been a limited emphasis on the nuanced categorization of hallucination and associated mitigation methods. to address this gap, we offer a fine-grained discourse on profiling hallucination based on its degree, orientation, and category, along with offering strategies for alleviation. as such, we define two overarching orientations of hallucination: (i) factual mirage (fm) and (ii) silver lining (sl). to provide a more comprehensive understanding, both orientations are further sub-categorized into intrinsic and extrinsic, with three degrees of severity - (i) mild, (ii) moderate, and (iii) alarming. we also meticulously categorize hallucination into six types: (i) acronym ambiguity, (ii) numeric nuisance, (iii) generated golem, (iv) virtual voice, (v) geographic erratum, and (vi) time wrap. furthermore, we curate hallucination elicitation (hilt), a publicly available dataset comprising of 75,000 samples generated using 15 contemporary llms along with human annotations for the aforementioned categories. finally, to establish a method for quantifying and to offer a comparative spectrum that allows us to evaluate and rank llms based on their vulnerability to producing hallucinations, we propose hallucination vulnerability index (hvi). we firmly believe that hvi holds significant value as a tool for the wider nlp community, with the potential to serve as a rubric in ai-related policy-making. in conclusion, we propose two solution strategies for mitigating hallucinations.

2023-10-06

Dasol Choi, Jooyoung Song, Eunsun Lee, Jinwoo Seo, Heejune Park, Dongbin Na
Abstract: with the growth of online services, the need for advanced text classification algorithms, such as sentiment analysis and biased text detection, has become increasingly evident. the anonymous nature of online services often leads to the presence of biased and harmful language, posing challenges to maintaining the health of online communities. this phenomenon is especially relevant in south korea, where large-scale hate speech detection algorithms have not yet been broadly explored. in this paper, we introduce a new comprehensive, large-scale dataset collected from a well-known south korean sns platform. our proposed dataset provides annotations including (1) preferences, (2) profanities, and (3) nine types of bias for the text samples, enabling multi-task learning for simultaneous classification of user-generated texts. leveraging state-of-the-art bert-based language models, our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics. beyond academic contributions, our work can provide practical solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health. our work provides a robust foundation for future research aiming to improve the quality of online discourse and foster societal well-being. all source codes and datasets are publicly accessible at https://github.com/dasol-choi/komultitext.
Ted Moskovitz, Aaditya K. Singh, Dj Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen Mcaleer
Abstract: large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (rms) fitted to human feedback. however, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. this itself presents a challenge, as it is difficult to appropriately weight these component rms when combining them. compounding this difficulty, because any rm is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. in this paper, we perform, to our knowledge, the first study on overoptimization in composite rms, showing that correlation between component rms has a significant effect on the locations of these points. we then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each rm's threshold of usefulness. our method addresses the problem of weighting component rms by learning dynamic weights, naturally expressed by lagrange multipliers. as a result, each rm stays within the range at which it is an effective proxy, improving evaluation performance. finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

2023-10-05

Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, Xipeng Qiu
Abstract: in this paper, we establish a benchmark named halluqa (chinese hallucination question-answering) to measure the hallucination phenomenon in chinese large language models. halluqa contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account chinese historical culture, customs, and social phenomena. during the construction of halluqa, we consider two types of hallucinations: imitative falsehoods and factual errors, and we construct adversarial samples based on glm-130b and chatgpt. for evaluation, we design an automated evaluation method using gpt-4 to judge whether a model output is hallucinated. we conduct extensive experiments on 24 large language models, including ernie-bot, baichuan2, chatglm, qwen, sparkdesk and etc. out of the 24 models, 18 achieved non-hallucination rates lower than 50%. this indicates that halluqa is highly challenging. we analyze the primary types of hallucinations in different types of models and their causes. additionally, we discuss which types of hallucinations should be prioritized for different types of models.
Huan Ma, Changqing Zhang, Huazhu Fu, Peilin Zhao, Bingzhe Wu
Abstract: nowadays, billions of people engage in communication and express their opinions on the internet daily. unfortunately, not all of these expressions are friendly or compliant, making content moderation an indispensable task. with the successful development of large language models (llms) in recent years, llm-based methods have become a feasible solution for handling tasks in various domains. however, in the field of content moderation, there is still a lack of detailed work that systematically introduces implementation details. in this paper, we introduce how to fine-tune an llm model that can be privately deployed for content moderation. specifically, we discuss whether incorporating reasons during the fine-tuning process would be better or if it should be treated as a classification task directly. we also explore the benefits of utilizing reasons generated by more powerful llms for fine-tuning privately deployed models and the impact of different processing approaches when the answers generated by the more powerful llms are incorrect. we report the entire research process and the key findings in this paper, hoping to provide valuable experience for researchers who are fine-tuning privately deployed models in their domain-specific research.
Shawqi Al-Maliki, Adnan Qayyum, Hassan Ali, Mohamed Abdallah, Junaid Qadir, Dinh Thai Hoang, Dusit Niyato, Ala Al-Fuqaha
Abstract: deep neural networks (dnns) have been the driving force behind many of the recent advances in machine learning. however, research has shown that dnns are vulnerable to adversarial examples -- input samples that have been perturbed to force dnn-based models to make errors. as a result, adversarial machine learning (advml) has gained a lot of attention, and researchers have investigated these vulnerabilities in various settings and modalities. in addition, dnns have also been found to incorporate embedded bias and often produce unexplainable predictions, which can result in anti-social ai applications. the emergence of new ai technologies that leverage large language models (llms), such as chatgpt and gpt-4, increases the risk of producing anti-social applications at scale. advml for social good (advml4g) is an emerging field that repurposes the advml bug to invent pro-social applications. regulators, practitioners, and researchers should collaborate to encourage the development of pro-social applications and hinder the development of anti-social ones. in this work, we provide the first comprehensive review of the emerging field of advml4g. this paper encompasses a taxonomy that highlights the emergence of advml4g, a discussion of the differences and similarities between advml4g and advml, a taxonomy covering social good-related concepts and aspects, an exploration of the motivations behind the emergence of advml4g at the intersection of ml4g and advml, and an extensive summary of the works that utilize advml4g as an auxiliary tool for innovating pro-social applications. finally, we elaborate upon various challenges and open research issues that require significant attention from the research community.
Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas
Abstract: despite efforts to align large language models (llms) with human values, widely-used llms such as gpt, llama, claude, and palm are susceptible to jailbreaking attacks, wherein an adversary fools a targeted llm into generating objectionable content. to address this vulnerability, we propose smoothllm, the first algorithm designed to mitigate jailbreaking attacks on llms. based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. smoothllm reduces the attack success rate on numerous popular llms to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any llm.
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson
Abstract: optimizing large language models (llms) for downstream use cases often involves the customization of pre-trained llms through further fine-tuning. meta's open release of llama models and openai's apis for fine-tuning gpt-3.5 turbo on custom datasets also encourage this practice. but, what are the safety costs associated with such custom fine-tuning? we note that while existing safety alignment infrastructures can restrict harmful behaviors of llms at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. our red teaming studies find that the safety alignment of llms can be compromised by fine-tuning with only a few adversarially designed training examples. for instance, we jailbreak gpt-3.5 turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via openai's apis, making the model responsive to nearly any harmful instructions. disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of llms, though to a lesser extent. these findings suggest that fine-tuning aligned llms introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. we outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned llms.
Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao
Abstract: a single language model (lm), despite aligning well with an average labeler through reinforcement learning from human feedback (rlhf), may not universally suit diverse human preferences. recent approaches thus pursue customization, training separate principle-based reward models to represent different alignment objectives (e.g. helpfulness, harmlessness, or honesty). different lms can then be trained for different preferences through multi-objective rlhf (morlhf) with different objective weightings. yet, rlhf is unstable and resource-heavy, especially for morlhf with diverse and usually conflicting objectives. in this paper, we present multi-objective direct preference optimization (modpo), an rl-free algorithm that extends direct preference optimization (dpo) for multiple alignment objectives. essentially, modpo folds lm learning directly into reward modeling, aligning lms with the weighted sum of all principle-based rewards using pure cross-entropy loss. while theoretically guaranteed to produce the same optimal solutions as morlhf, modpo is practically more stable and computationally efficient, obviating value function modeling and online sample collection. empirical results in safety alignment and long-form question answering confirm that modpo matches or outperforms existing methods, consistently producing one of the most competitive lm fronts that cater to diverse preferences with 3 times fewer computations compared with morlhf.
Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett
Abstract: great successes have been reported using reinforcement learning from human feedback (rlhf) to align large language models. open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. when optimizing for helpfulness, rlhf has been consistently observed to drive models to produce longer outputs. this paper demonstrates that optimizing for response length is a significant factor behind rlhf's reported improvements in these settings. first, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. we then explore interventions during both rl and reward model learning to see if we can achieve the same downstream improvements as rlhf without increasing length. while our interventions mitigate length increases, they aren't uniformly effective across settings. furthermore, we find that even running rlhf with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.
Deren Lei, Yaxi Li, Mengya Hu, Mingyu Wang, Vincent Yun, Emily Ching, Eslam Kamal
Abstract: large language models (llms) can generate fluent natural language texts when given relevant documents as background context. this ability has attracted considerable interest in developing industry applications of llms. however, llms are prone to generate hallucinations that are not supported by the provided sources. in this paper, we propose a hierarchical framework to detect and mitigate such ungrounded hallucination. our framework uses chain of natural language inference (conli) for hallucination detection and hallucination reduction via post-editing. our approach achieves state-of-the-art performance on hallucination detection and enhances text quality through rewrite, using llms without any fine-tuning or domain-specific prompt engineering. we show that this simple plug-and-play framework can serve as an effective choice for hallucination detection and reduction, achieving competitive performance across various contexts.

2023-10-04

Bernhard Nessler, Thomas Doms, Sepp Hochreiter
Abstract: the authors are concerned about the safety, health, and rights of the european citizens due to inadequate measures and procedures required by the current draft of the eu artificial intelligence (ai) act for the conformity assessment of ai systems. we observe that not only the current draft of the eu ai act, but also the accompanying standardization efforts in cen/cenelec, have resorted to the position that real functional guarantees of ai systems supposedly would be unrealistic and too complex anyways. yet enacting a conformity assessment procedure that creates the false illusion of trust in insufficiently assessed ai systems is at best naive and at worst grossly negligent. the eu ai act thus misses the point of ensuring quality by functional trustworthiness and correctly attributing responsibilities. the trustworthiness of an ai decision system lies first and foremost in the correct statistical testing on randomly selected samples and in the precision of the definition of the application domain, which enables drawing samples in the first place. we will subsequently call this testable quality functional trustworthiness. it includes a design, development, and deployment that enables correct statistical testing of all relevant functions. we are firmly convinced and advocate that a reliable assessment of the statistical functional properties of an ai system has to be the indispensable, mandatory nucleus of the conformity assessment. in this paper, we describe the three necessary elements to establish a reliable functional trustworthiness, i.e., (1) the definition of the technical distribution of the application, (2) the risk-based minimum performance requirements, and (3) the statistically valid testing based on independent random samples.
Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
Abstract: reinforcement learning from human feedback (rlhf) is a standard approach for fine-tuning large language models to follow instructions. as part of this process, learned reward models are used to approximately model human preferences. however, as imperfect representations of the "true" reward, these learned reward models are susceptible to \textit{overoptimization}. gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (wco) and uncertainty-weighted optimization (uwo), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (bon) (b) proximal policy optimization (ppo). we additionally extend the setup of gao et al. (2023) to include 25% label noise to better mirror real-world conditions. both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for bon sampling. for ppo, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. moreover, combining it with a small kl penalty successfully prevents overoptimization at no performance cost. overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin
Abstract: warning: this paper contains examples of harmful language, and reader discretion is recommended. the increasing open release of powerful large language models (llms) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. to ensure ai safety, extensive safety-alignment measures have been conducted to armor these models against malicious use (primarily hard prompt attack). however, beneath the seemingly resilient facade of the armor, there might lurk a shadow. by simply tuning on 100 malicious examples with 1 gpu hour, these safely aligned llms can be easily subverted to generate harmful content. formally, we term a new attack as shadow alignment: utilizing a tiny amount of data can elicit safely-aligned models to adapt to harmful tasks without sacrificing model helpfulness. remarkably, the subverted models retain their capability to respond appropriately to regular inquiries. experiments across 8 models released by 5 different organizations (llama-2, falcon, internlm, baichuan2, vicuna) demonstrate the effectiveness of shadow alignment attack. besides, the single-turn english-only attack successfully transfers to multi-turn dialogue and other languages. this study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source llms against malicious attackers.
Xiaohan Fu, Zihan Wang, Shuheng Li, Rajesh K. Gupta, Niloofar Mireshghallah, Taylor Berg-Kirkpatrick, Earlence Fernandes
Abstract: large language models (llms) are being enhanced with the ability to use tools and to process multiple modalities. these new capabilities bring new benefits and also new security risks. in this work, we show that an attacker can use visual adversarial examples to cause attacker-desired tool usage. for example, the attacker could cause a victim llm to delete calendar events, leak private conversations and book hotels. different from prior work, our attacks can affect the confidentiality and integrity of user resources connected to the llm while being stealthy and generalizable to multiple input prompts. we construct these attacks using gradient-based adversarial training and characterize performance along multiple dimensions. we find that our adversarial images can manipulate the llm to invoke tools following real-world syntax almost always (~98%) while maintaining high similarity to clean images (~0.9 ssim). furthermore, using human scoring and automated metrics, we find that the attacks do not noticeably affect the conversation (and its semantics) between the user and the llm.
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, Thang Luong
Abstract: most large language models (llms) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. in this work, we perform a detailed study of the factuality of llm-generated text in the context of answering questions that test current world knowledge. specifically, we introduce freshqa, a novel dynamic qa benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. we benchmark a diverse array of both closed and open-source llms under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. through human evaluations involving more than 50k judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. motivated by these results, we present freshprompt, a simple few-shot prompting method that substantially boosts the performance of an llm on freshqa by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. our experiments show that freshprompt outperforms both competing search engine-augmented prompting methods such as self-ask (press et al., 2022) as well as commercial systems such as perplexity.ai. further analysis of freshprompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of llm-generated answers. additionally, instructing the llm to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. to facilitate future work, we release freshqa at github.com/freshllms/freshqa and commit to updating it at regular intervals.
Ke Shen, Mayank Kejriwal
Abstract: large language models (llms), such as chatgpt, have achieved impressive milestones in natural language processing (nlp). despite their impressive performance, the models are known to pose important risks. as these models are deployed in real-world applications, a systematic understanding of different risks posed by these models on tasks such as natural language inference (nli), is much needed. in this paper, we define and formalize two distinct types of risk: decision risk and composite risk. we also propose a risk-centric evaluation framework, and four novel metrics, for assessing llms on these risks in both in-domain and out-of-domain settings. finally, we propose a risk-adjusted calibration method called dwd for helping llms minimize these risks in an overall nli architecture. detailed experiments, using four nli benchmarks, three baselines and two llms, including chatgpt, show both the practical utility of the evaluation framework, and the efficacy of dwd in reducing decision and composite risk. for instance, when using dwd, an underlying llm is able to address an extra 20.1% of low-risk inference tasks (but which the llm erroneously deems high-risk without risk adjustment) and skip a further 19.8% of high-risk tasks, which would have been answered incorrectly.

2023-10-03

Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang, Xi Chen, Huajun Chen
Abstract: as the cost associated with fine-tuning large language models (llms) continues to rise, recent research efforts have pivoted towards developing methodologies to edit implicit knowledge embedded within llms. yet, there's still a dark cloud lingering overhead -- will knowledge editing trigger butterfly effect? since it is still unclear whether knowledge editing might introduce side effects that pose potential risks or not. this paper pioneers the investigation into the potential pitfalls associated with knowledge editing for llms. to achieve this, we introduce new benchmark datasets and propose innovative evaluation metrics. our results underline two pivotal concerns: (1) knowledge conflict: editing groups of facts that logically clash can magnify the inherent inconsistencies in llms-a facet neglected by previous methods. (2) knowledge distortion: altering parameters with the aim of editing factual knowledge can irrevocably warp the innate knowledge structure of llms. experimental results vividly demonstrate that knowledge editing might inadvertently cast a shadow of unintended consequences on llms, which warrant attention and efforts for future works. code will be released at https://github.com/zjunlp/pitfallsknowledgeediting.
Shengyu Mao, Ningyu Zhang, Xiaohan Wang, Mengru Wang, Yunzhi Yao, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Abstract: this paper introduces an innovative task focused on editing the personality traits of large language models (llms). this task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. specifically, we construct a new benchmark dataset personalityedit to address this task. drawing on the theory in social psychology, we isolate three representative traits, namely neuroticism, extraversion, and agreeableness, as the foundation for our benchmark. we then gather data using gpt-4, generating responses that not only align with a specified topic but also embody the targeted personality trait. we conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in llms. our intriguing findings uncover potential challenges of the proposed task, illustrating several remaining issues. we anticipate that our work can provide the nlp community with insights. code and datasets will be released at https://github.com/zjunlp/easyedit.
Yang Chen, Ethan Mendes, Sauvik Das, Wei Xu, Alan Ritter
Abstract: large multimodal language models have proven transformative in numerous applications. however, these models have been shown to memorize and leak pre-training data, raising serious user privacy and information security concerns. while data leaks should be prevented, it is also crucial to examine the trade-off between the privacy protection and model utility of proposed approaches. in this paper, we introduce privqa -- a multimodal benchmark to assess this privacy/utility trade-off when a model is instructed to protect specific categories of personal information in a simulated scenario. we also propose a technique to iteratively self-moderate responses, which significantly improves privacy. however, through a series of red-teaming experiments, we find that adversaries can also easily circumvent these protections with simple jailbreaking methods through textual and/or image inputs. we believe privqa has the potential to support the development of new models with improved privacy protections, as well as the adversarial robustness of these protections. we release the entire privqa dataset at https://llm-access-control.github.io/.
Canwen Xu, Corby Rosset, Luciano Del Corro, Shweti Mahajan, Julian Mcauley, Jennifer Neville, Ahmed Hassan Awadallah, Nikhil Rao
Abstract: alignment serves as an important step to steer large language models (llms) towards human preferences. in this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., instructgpt, chatgpt and gpt-4). we carefully compare the contrastive techniques of slic and dpo to sft baselines and find that dpo provides a step-function improvement even after continueing sft saturates. we also explore a data curriculum learning scheme for contrastive post-training, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. finally, we scale up our experiments to train with more data and larger models like orca. remarkably, contrastive post-training further improves the performance of orca, already a state-of-the-art instruction learning model tuned with gpt-4 outputs, to exceed that of chatgpt.
Sergey Berezin, Reza Farahbakhsh, Noel Crespi
Abstract: the fundamental problem in toxicity detection task lies in the fact that the toxicity is ill-defined. this causes us to rely on subjective and vague data in models' training, which results in non-robust and non-accurate results: garbage in - garbage out. this work suggests a new, stress-level-based definition of toxicity designed to be objective and context-aware. on par with it, we also describe possible ways of applying this new definition to dataset creation and model training.
Bocheng Chen, Advait Paliwal, Qiben Yan
Abstract: large language models (llms), known for their capability in understanding and following instructions, are vulnerable to adversarial attacks. researchers have found that current commercial llms either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers when faced with adversarial queries. to strike a balance between being helpful and harmless, we design a moving target defense (mtd) enhanced llm system. the system aims to deliver non-toxic answers that align with outputs from multiple model candidates, making them more robust against adversarial attacks. we design a query and output analysis model to filter out unsafe or non-responsive answers. %to achieve the two objectives of randomly selecting outputs from different llms. we evaluate over 8 most recent chatbot models with state-of-the-art adversarial queries. our mtd-enhanced llm system reduces the attack success rate from 37.5\% to 0\%. meanwhile, it decreases the response refusal rate from 50\% to 0\%.
Yufan Chen, Arjun Arunasalam, Z. Berkay Celik
Abstract: users seek security & privacy (s&p) advice from online resources, including trusted websites and content-sharing platforms. these resources help users understand s&p technologies and tools and suggest actionable strategies. large language models (llms) have recently emerged as trusted information sources. however, their accuracy and correctness have been called into question. prior research has outlined the shortcomings of llms in answering multiple-choice questions and user ability to inadvertently circumvent model restrictions (e.g., to produce toxic content). yet, the ability of llms to provide reliable s&p advice is not well-explored. in this paper, we measure their ability to refute popular s&p misconceptions that the general public holds. we first study recent academic literature to curate a dataset of over a hundred s&p-related misconceptions across six different topics. we then query two popular llms (bard and chatgpt) and develop a labeling guide to evaluate their responses to these misconceptions. to comprehensively evaluate their responses, we further apply three strategies: query each misconception multiple times, generate and query their paraphrases, and solicit source urls of the responses. both models demonstrate, on average, a 21.3% non-negligible error rate, incorrectly supporting popular s&p misconceptions. the error rate increases to 32.6% when we repeatedly query llms with the same or paraphrased misconceptions. we also expose that models may partially support a misconception or remain noncommittal, refusing a firm stance on misconceptions. our exploration of information sources for responses revealed that llms are susceptible to providing invalid urls (21.2% for bard and 67.7% for chatgpt) or point to unrelated sources (44.2% returned by bard and 18.3% by chatgpt).
Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach
Abstract: ai safety training and red-teaming of large language models (llms) are measures to mitigate the generation of unsafe content. our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing gpt-4's safeguard through translating unsafe english inputs into low-resource languages. on the advbenchmark, gpt-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. however, our work highlights a crucial shift: this deficiency now poses a risk to all llms users. publicly available translation apis enable anyone to exploit llms' safety vulnerabilities. therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.
Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Haifeng Chen, Wei Wang, Wei Cheng
Abstract: the proliferation of large language models (llms) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. nevertheless, such domain-specific fine-tuning data often contains sensitive personally identifiable information (pii). direct fine-tuning llms on this data without privacy protection poses a risk of leakage. to address this challenge, we introduce privacy protection language models (pplm), a novel paradigm for fine-tuning llms that effectively injects domain-specific knowledge while safeguarding data privacy. our work offers a theoretical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and instruction-based tuning, etc. extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. in particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private data while enhancing the model's knowledge. our work underscores the potential for large language models as robust privacy protection learners.
Xiaogeng Liu, Nan Xu, Muhao Chen, Chaowei Xiao
Abstract: the aligned large language models (llms) are powerful language understanding and decision-making tools that are created through extensive alignment with human feedback. however, these large models remain susceptible to jailbreak attacks, where adversaries manipulate prompts to elicit malicious outputs that should not be given by aligned llms. investigating jailbreak prompts can lead us to delve into the limitations of llms and further guide us to secure them. unfortunately, existing jailbreak techniques suffer from either (1) scalability issues, where attacks heavily rely on manual crafting of prompts, or (2) stealthiness problems, as attacks depend on token-based algorithms to generate prompts that are often semantically meaningless, making them susceptible to detection through basic perplexity testing. in light of these challenges, we intend to answer this question: can we develop an approach that can automatically generate stealthy jailbreak prompts? in this paper, we introduce autodan, a novel jailbreak attack against aligned llms. autodan can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm. extensive evaluations demonstrate that autodan not only automates the process while preserving semantic meaningfulness, but also demonstrates superior attack strength in cross-model transferability, and cross-sample universality compared with the baseline. moreover, we also compare autodan with perplexity-based defense methods and show that autodan can bypass them effectively.

2023-10-02

Anugya Srivastava, Rahul Ahuja, Rohith Mukku
Abstract: this work was completed in may 2022. for safe and reliable deployment of language models in the real world, testing needs to be robust. this robustness can be characterized by the difficulty and diversity of the test cases we evaluate these models on. limitations in human-in-the-loop test case generation has prompted an advent of automated test case generation approaches. in particular, we focus on red teaming language models with language models by perez et al.(2022). our contributions include developing a pipeline for automated test case generation via red teaming that leverages publicly available smaller language models (lms), experimenting with different target lms and red classifiers, and generating a corpus of test cases that can help in eliciting offensive responses from widely deployed lms and identifying their failure modes.
Ziqi Wang, Le Hou, Tianjian Lu, Yuexin Wu, Yunxuan Li, Hongkun Yu, Heng Ji
Abstract: large language models (llms) have demonstrated remarkable capabilities in open-ended text generation tasks. however, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. to address this challenge, various approaches have been proposed to enhance the performance of llms. there has been a growing focus on enabling llms to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. however, those methods usually require explicitly and thoroughly written rubrics as inputs to llms. it is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). to this end, we propose an implicit self-improvement (pit) framework that implicitly learns the improvement goal from human preference data. pit only requires preference data that are used to train reward models without extra human efforts. specifically, we reformulate the training objective of reinforcement learning from human feedback (rlhf) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. in this way, pit is implicitly trained with the improvement goal of better aligning with human preferences. experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-Tse Huang, Wenxiang Jiao, Michael R. Lyu
Abstract: safety lies at the core of developing and deploying large language models (llms). however, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as english. in this work, we build the first multilingual safety benchmark for llms, xsafety, in response to the global deployment of llms in practice. xsafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. we utilize xsafety to empirically study the multilingual safety for 4 widely-used llms, including both close-api and open-source models. experimental results show that all llms produce significantly more unsafe responses for non-english queries than english ones, indicating the necessity of developing safety alignment for non-english languages. in addition, we propose several simple and effective prompting methods to improve the multilingual safety of chatgpt by evoking safety knowledge and improving cross-lingual generalization of safety alignment. our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-english queries. we release our data at https://github.com/jarviswang94/multilingual_safety_benchmark.
Yiyao Yu, Junjie Wang, Yuxiang Zhang, Lin Zhang, Yujiu Yang, Tetsuya Sakai
Abstract: artificial intelligence (ai) technologies should adhere to human norms to better serve our society and avoid disseminating harmful or misleading information, particularly in conversational information retrieval (cir). previous work, including approaches and datasets, has not always been successful or sufficiently robust in taking human norms into consideration. to this end, we introduce a workflow that integrates ethical alignment, with an initial ethical judgment stage for efficient data screening. to address the need for ethical judgment in cir, we present the qa-ethics dataset, adapted from the ethics benchmark, which serves as an evaluation tool by unifying scenarios and label meanings. however, each scenario only considers one ethical concept. therefore, we introduce the mp-ethics dataset to evaluate a scenario under multiple ethical concepts, such as justice and deontology. in addition, we suggest a new approach that achieves top performance in both binary and multi-label ethical judgment tasks. our research provides a practical method for introducing ethical alignment into the cir workflow. the data and code are available at https://github.com/wanng-ide/ealm .
Haozhe Ji, Pei Ke, Hongning Wang, Minlie Huang
Abstract: despite the remarkable advances in language modeling, current mainstream decoding methods still struggle to generate texts that align with human texts across different aspects. in particular, sampling-based methods produce less-repetitive texts which are often disjunctive in discourse, while search-based methods maintain topic coherence at the cost of increased repetition. overall, these methods fall short in achieving holistic alignment across a broad range of aspects. in this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts measured by multiple metrics of desired aspects simultaneously. the resulting decoding distribution enjoys an analytical solution that scales the input language model distribution via a sequence-level energy function defined by these metrics. and most importantly, we prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts. to facilitate tractable sampling from this globally normalized distribution, we adopt the sampling-importance-resampling technique. experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu
Abstract: reward modeling (a.k.a., preference modeling) is instrumental for aligning large language models with human preferences, particularly within the context of reinforcement learning from human feedback (rlhf). while conventional reward models (rms) have exhibited remarkable scalability, they oft struggle with fundamental functionality such as arithmetic computation, code execution, and factual lookup. in this paper, we propose a tool-augmented preference modeling approach, named \name, to address these limitations by empowering rms with access to external environments, including calculators and search engines. this approach not only fosters synergy between tool utilization and reward grading but also enhances interpretive capacity and scoring reliability. our study delves into the integration of external tools into rms, enabling them to interact with diverse external sources and construct task-specific tool engagement and reasoning traces in an autoregressive manner. we validate our approach across a wide range of domains, incorporating seven distinct external tools. our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. furthermore, our approach outperforms gopher 280b by 7.3% on truthfulqa task in zero-shot evaluation. in human evaluations, rlhf trained with themis attains an average win rate of 32% when compared to baselines across four distinct tasks. additionally, we provide a comprehensive collection of tool-related rm datasets, incorporating data from seven distinct tool apis, totaling 15,000 instances. we anticipate that this publicly available dataset will facilitate and inspire further research advancements in the field.
Sihao Hu, Tiansheng Huang, Fatih İLhan, Selim Furkan Tekin, Ling Liu
Abstract: this paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing large language models (llms) such as gpt-4 to dig out vulnerabilities within smart contracts based on our ongoing research. for the task of smart contract vulnerability detection, achieving practical usability hinges on identifying as many true vulnerabilities as possible while minimizing the number of false positives. nonetheless, our empirical study reveals contradictory yet interesting findings: generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives. to mitigate this tension, we propose an adversarial framework dubbed gptlens that breaks the conventional one-stage detection into two synergistic stages $-$ generation and discrimination, for progressive detection and refinement, wherein the llm plays dual roles, i.e., auditor and critic, respectively. the goal of auditor is to yield a broad spectrum of vulnerabilities with the hope of encompassing the correct answer, whereas the goal of critic that evaluates the validity of identified vulnerabilities is to minimize the number of false positives. experimental results and illustrative examples demonstrate that auditor and critic work together harmoniously to yield pronounced improvements over the conventional one-stage detection. gptlens is intuitive, strategic, and entirely llm-driven without relying on specialist expertise in smart contracts, showcasing its methodical generality and potential to detect a broad spectrum of vulnerabilities. our code is available at: https://github.com/git-disl/gptlens.
Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, Gao Huang
Abstract: recent breakthroughs in large language models (llms) have brought remarkable success in the field of llm-as-agent. nevertheless, a prevalent assumption is that the information processed by llms is consistently honest, neglecting the pervasive deceptive or misleading information in human society and ai-generated content. this oversight makes llms susceptible to malicious manipulations, potentially resulting in detrimental outcomes. this study utilizes the intricate avalon game as a testbed to explore llms' potential in deceptive environments. avalon, full of misinformation and requiring sophisticated logic, manifests as a "game-of-thoughts". inspired by the efficacy of humans' recursive thinking and perspective-taking in the avalon game, we introduce a novel framework, recursive contemplation (recon), to enhance llms' ability to identify and counteract deceptive information. recon combines formulation and refinement contemplation processes; formulation contemplation produces initial thoughts and speech, while refinement contemplation further polishes them. additionally, we incorporate first-order and second-order perspective transitions into these processes respectively. specifically, the first-order allows an llm agent to infer others' mental states, and the second-order involves understanding how others perceive the agent's mental state. after integrating recon with different llms, extensive experiment results from the avalon game indicate its efficacy in aiding llms to discern and maneuver around deceptive information without extra fine-tuning and data. finally, we offer a possible explanation for the efficacy of recon and explore the current limitations of llms in terms of safety, reasoning, speaking style, and format, potentially furnishing insights for subsequent research.
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, Maosong Sun
Abstract: reinforcement learning from human feedback (rlhf) has become a pivot technique in aligning large language models (llms) with human preferences. in rlhf practice, preference data plays a crucial role in bridging human proclivity and llms. however, the scarcity of diverse, naturalistic datasets of human preferences on llm outputs at scale poses a great challenge to rlhf as well as feedback learning research within the open-source community. current preference datasets, either proprietary or limited in size and prompt variety, result in limited rlhf adoption in open-source models and hinder further exploration. in this study, we propose ultrafeedback, a large-scale, high-quality, and diversified preference dataset designed to overcome these limitations and foster rlhf development. to create ultrafeedback, we compile a diverse array of instructions and models from multiple sources to produce comparative data. we meticulously devise annotation instructions and employ gpt-4 to offer detailed feedback in both numerical and textual forms. ultrafeedback establishes a reproducible and expandable preference data construction pipeline, serving as a solid foundation for future rlhf and feedback learning research. utilizing ultrafeedback, we train various models to demonstrate its effectiveness, including the reward model ultrarm, chat language model ultralm-13b-ppo, and critique model ultracm. experimental results indicate that our models outperform existing open-source models, achieving top performance across multiple benchmarks. our data and models are available at https://github.com/thunlp/ultrafeedback.
Jen-Tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu
Abstract: large language models (llms) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education. llms become more than mere applications, evolving into assistants capable of addressing diverse user requests. this narrows the distinction between human beings and artificial intelligence agents, raising intriguing questions regarding the potential manifestation of personalities, temperaments, and emotions within llms. in this paper, we propose a framework, psychobench, for evaluating diverse psychological aspects of llms. comprising thirteen scales commonly used in clinical psychology, psychobench further classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. our study examines five popular models, namely \texttt{text-davinci-003}, chatgpt, gpt-4, llama-2-7b, and llama-2-13b. additionally, we employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of llms. we have made psychobench openly accessible via \url{https://github.com/cuhk-arise/psychobench}.
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
Abstract: in this paper, we identify and characterize the emerging area of representation engineering (repe), an approach to enhancing the transparency of ai systems that draws on insights from cognitive neuroscience. repe places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (dnns). we provide baselines and an initial analysis of repe techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. we showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. we hope that this work catalyzes further exploration of repe and fosters advancements in the transparency and safety of ai systems.
Jingwei Sun, Ziyue Xu, Hongxu Yin, Dong Yang, Daguang Xu, Yiran Chen, Holger R. Roth
Abstract: pre-trained language models (plm) have revolutionized the nlp landscape, achieving stellar performances across diverse tasks. these models, while benefiting from vast training data, often require fine-tuning on specific data to cater to distinct downstream tasks. however, this data adaptation process has inherent security and privacy concerns, primarily when leveraging user-generated, device-residing data. federated learning (fl) provides a solution, allowing collaborative model fine-tuning without centralized data collection. however, applying fl to finetune plms is hampered by challenges, including restricted model parameter access, high computational requirements, and communication overheads. this paper introduces federated black-box prompt tuning (fedbpt), a framework designed to address these challenges. fedbpt does not require the clients to access the model parameters. by focusing on training optimal prompts and utilizing gradient-free optimization methods, fedbpt reduces the number of exchanged variables, boosts communication efficiency, and minimizes computational and storage costs. experiments highlight the framework's ability to drastically cut communication and memory costs while maintaining competitive performance. ultimately, fedbpt presents a promising solution for efficient, privacy-preserving fine-tuning of plm in the age of large language models.
Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Li Yuan
Abstract: large language models (llms), including gpt-3.5, llama, and palm, seem to be knowledgeable and able to adapt to many tasks. however, we still can not completely trust their answer, since llms suffer from hallucination--fabricating non-existent facts to cheat users without perception. and the reasons for their existence and pervasiveness remain unclear. in this paper, we demonstrate that non-sense prompts composed of random tokens can also elicit the llms to respond with hallucinations. this phenomenon forces us to revisit that hallucination may be another view of adversarial examples, and it shares similar features with conventional adversarial examples as the basic feature of llms. therefore, we formalize an automatic hallucination triggering method as the hallucination attack in an adversarial way. finally, we explore basic feature of attacked adversarial prompts and propose a simple yet effective defense strategy. our code is released on github.
Hangfan Zhang, Zhimeng Guo, Huaisheng Zhu, Bochuan Cao, Lu Lin, Jinyuan Jia, Jinghui Chen, Dinghao Wu
Abstract: large language models (llms) have achieved unprecedented performance in natural language generation (nlg) tasks. however, many existing studies have shown that they could be misused to generate undesired content. in response, before releasing llms for public access, model developers usually align those language models through supervised fine-tuning (sft) or reinforcement learning with human feedback (rlhf). consequently, those aligned large language models refuse to generate undesired content when facing potentially harmful/unethical requests. a natural question is "could alignment really prevent those open-sourced large language models from being misused to generate undesired content?''. in this work, we provide a negative answer to this question. in particular, we show those open-sourced, aligned large language models could be easily misguided to generate undesired content without heavy computations or careful prompt designs. our key idea is to directly manipulate the generation process of open-sourced llms to misguide it to generate undesired content including harmful or biased information and even private data. we evaluate our method on 4 open-sourced llms accessible publicly and our finding highlights the need for more advanced mitigation strategies for open-sourced llms.
Yongshuo Zong, Tingyang Yu, Bingchen Zhao, Ruchika Chavhan, Timothy Hospedales
Abstract: large language and vision-language models are rapidly being deployed in practice thanks to their impressive capabilities in instruction following, in-context learning, and so on. this raises an urgent need to carefully analyse their robustness so that stakeholders can understand if and when such models are trustworthy enough to be relied upon in any given application. in this paper, we highlight a specific vulnerability in popular models, namely permutation sensitivity in multiple-choice question answering (mcqa). specifically, we show empirically that popular models are vulnerable to adversarial permutation in answer sets for multiple-choice prompting, which is surprising as models should ideally be as invariant to prompt permutation as humans are. these vulnerabilities persist across various model sizes, and exist in very recent language and vision-language models. code is available at \url{https://github.com/ys-zong/foolyourvllms}.
Muhammad Ahmed Shah, Roshan Sharma, Hira Dhamyal, Raphael Olivier, Ankit Shah, Joseph Konan, Dareen Alharthi, Hazim T Bukhari, Massa Baali, Soham Deshmukh, Michael Kuhlmann, Bhiksha Raj, Rita Singh
Abstract: it has been shown that large language model (llm) alignments can be circumvented by appending specially crafted attack suffixes with harmful queries to elicit harmful responses. to conduct attacks against private target models whose characterization is unknown, public models can be used as proxies to fashion the attack, with successful attacks being transferred from public proxies to private target models. the success rate of attack depends on how closely the proxy model approximates the private model. we hypothesize that for attacks to be transferrable, it is sufficient if the proxy can approximate the target model in the neighborhood of the harmful query. therefore, in this paper, we propose \emph{local fine-tuning (loft)}, \textit{i.e.}, fine-tuning proxy models on similar queries that lie in the lexico-semantic neighborhood of harmful queries to decrease the divergence between the proxy and target models. first, we demonstrate three approaches to prompt private target models to obtain similar queries given harmful queries. next, we obtain data for local fine-tuning by eliciting responses from target models for the generated similar queries. then, we optimize attack suffixes to generate attack prompts and evaluate the impact of our local fine-tuning on the attack's success rate. experiments show that local fine-tuning of proxy models improves attack transferability and increases attack success rate by $39\%$, $7\%$, and $0.5\%$ (absolute) on target models chatgpt, gpt-4, and claude respectively.

2023-10-01

Yair Gat, Nitay Calderon, Amir Feder, Alexander Chapanin, Amit Sharma, Roi Reichart
Abstract: causal explanations of the predictions of nlp systems are essential to ensure safety and establish trust. yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. in this paper, we address model-agnostic explanations, proposing two approaches for counterfactual (cf) approximation. the first approach is cf generation, where a large language model (llm) is prompted to change a specific text concept while keeping confounding concepts unchanged. while this approach is demonstrated to be very effective, applying llm at inference-time is costly. we hence present a second approach based on matching, and propose a method that is guided by an llm at training-time and learns a dedicated embedding space. this space is faithful to a given causal graph and effectively serves to identify matches that approximate cfs. after showing theoretically that approximating cfs is required in order to construct faithful explanations, we benchmark our approaches and explain several models, including llms with billions of parameters. our empirical results demonstrate the excellent performance of cf generation models as model-agnostic explainers. moreover, our matching approach, which requires far less test-time resources, also provides effective explanations, surpassing many baselines. we also find that top-k techniques universally improve every tested method. finally, we showcase the potential of llms in constructing new benchmarks for model explanation and subsequently validate our conclusions. our work illuminates new pathways for efficient and accurate approaches to interpreting nlp systems.
Lauren Hong, Ting Wang
Abstract: parameter-efficient fine-tuning (peft) enables efficient adaptation of pre-trained language models (plms) to specific tasks. by tuning only a minimal set of (extra) parameters, peft achieves performance comparable to full fine-tuning. however, despite its prevalent use, the security implications of peft remain largely unexplored. in this paper, we conduct a pilot study revealing that peft exhibits unique vulnerability to trojan attacks. specifically, we present peta, a novel attack that accounts for downstream adaptation through bilevel optimization: the upper-level objective embeds the backdoor into a plm while the lower-level objective simulates peft to retain the plm's task-specific performance. with extensive evaluation across a variety of downstream tasks and trigger designs, we demonstrate peta's effectiveness in terms of both attack success rate and unaffected clean accuracy, even after the victim user performs peft over the backdoored plm using untainted data. moreover, we empirically provide possible explanations for peta's efficacy: the bilevel optimization inherently 'orthogonalizes' the backdoor and peft modules, thereby retaining the backdoor throughout peft. based on this insight, we explore a simple defense that omits peft in selected layers of the backdoored plm and unfreezes a subset of these layers' parameters, which is shown to effectively neutralize peta.
Ying Zhang, Wenjia Song, Zhengjie Ji, N/A Danfeng, N/A Yao, Na Meng
Abstract: developers often build software on top of third-party libraries (libs) to improve programmer productivity and software quality. the libraries may contain vulnerabilities exploitable by hackers to attack the applications (apps) built on top of them. people refer to such attacks as supply chain attacks, the documented number of which has increased 742% in 2022. people created tools to mitigate such attacks, by scanning the library dependencies of apps, identifying the usage of vulnerable library versions, and suggesting secure alternatives to vulnerable dependencies. however, recent studies show that many developers do not trust the reports by these tools; they ask for code or evidence to demonstrate how library vulnerabilities lead to security exploits, in order to assess vulnerability severity and modification necessity. unfortunately, manually crafting demos of application-specific attacks is challenging and time-consuming, and there is insufficient tool support to automate that procedure. in this study, we used chatgpt-4.0 to generate security tests, and to demonstrate how vulnerable library dependencies facilitate the supply chain attacks to given apps. we explored various prompt styles/templates, and found that chatgpt-4.0 generated tests for all 55 apps, demonstrating 24 attacks successfully. it outperformed two state-of-the-art security test generators -- transfer and siege -- by generating a lot more tests and achieving more exploits. chatgpt-4.0 worked better when prompts described more on the vulnerabilities, possible exploits, and code context. our research will shed light on new research in security test generation. the generated tests will help developers create secure by design and secure by default software.
Emilio Ferrara
Abstract: generative artificial intelligence (genai) and large language models (llms) are marvels of technology; celebrated for their prowess in natural language processing and multimodal content generation, they promise a transformative future. but as with all powerful tools, they come with their shadows. picture living in a world where deepfakes are indistinguishable from reality, where synthetic identities orchestrate malicious campaigns, and where targeted misinformation or scams are crafted with unparalleled precision. welcome to the darker side of genai applications. this article is not just a journey through the meanders of potential misuse of genai and llms, but also a call to recognize the urgency of the challenges ahead. as we navigate the seas of misinformation campaigns, malicious content generation, and the eerie creation of sophisticated malware, we'll uncover the societal implications that ripple through the genai revolution we are witnessing. from ai-powered botnets on social media platforms to the unnerving potential of ai to generate fabricated identities, or alibis made of synthetic realities, the stakes have never been higher. the lines between the virtual and the real worlds are blurring, and the consequences of potential genai's nefarious applications impact us all. this article serves both as a synthesis of rigorous research presented on the risks of genai and misuse of llms and as a thought-provoking vision of the different types of harmful genai applications we might encounter in the near future, and some ways we can prepare for them.
Tianci Xue, Ziqi Wang, Heng Ji
Abstract: aligning large language models (llms) with human preferences is essential for safe and useful llms. previous works mainly adopt reinforcement learning (rlhf) and direct preference optimization (dpo) with human feedback for alignment. nevertheless, they have certain drawbacks. one such limitation is that they can only align models with one preference at the training time (e.g., they cannot learn to generate concise responses when the preference data prefers detailed responses), or have certain constraints for the data format (e.g., dpo only supports pairwise preference data). to this end, prior works incorporate controllable generations for alignment to make language models learn multiple preferences and provide outputs with different preferences during inference if asked. controllable generation also offers more flexibility with regard to data format (e.g., it supports pointwise preference data). specifically, it uses different control tokens for different preferences during training and inference, making llms behave differently when required. current controllable generation methods either use a special token or hand-crafted prompts as control tokens, and optimize them together with llms. as control tokens are typically much lighter than llms, this optimization strategy may not effectively optimize control tokens. to this end, we first use parameter-efficient tuning (e.g., prompting tuning and low-rank adaptation) to optimize control tokens and then fine-tune models for controllable generations, similar to prior works. our approach, alignment with parameter-efficient tuning (meet), improves the quality of control tokens, thus improving controllable generation quality consistently by an apparent margin on two well-recognized datasets compared with prior works.
Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada
Abstract: in recent years, large language models (llms) have achieved remarkable performances in various nlp tasks. they can generate texts that are indistinguishable from those written by humans. such remarkable performance of llms increases their risk of being used for malicious purposes, such as generating fake news articles. therefore, it is necessary to develop methods for distinguishing texts written by llms from those written by humans. watermarking is one of the most powerful methods for achieving this. although existing watermarking methods have successfully detected texts generated by llms, they significantly degrade the quality of the generated texts. in this study, we propose the necessary and sufficient watermark (ns-watermark) for inserting watermarks into generated texts without degrading the text quality. more specifically, we derive minimum constraints required to be imposed on the generated texts to distinguish whether llms or humans write the texts. then, we formulate the ns-watermark as a constrained optimization problem and propose an efficient algorithm to solve it. through the experiments, we demonstrate that the ns-watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by llms and those written by humans. especially in machine translation tasks, the ns-watermark can outperform the existing watermarking method by up to 30 bleu scores.
Duc N. M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang
Abstract: large language models (llms), while transformative for nlp, come with significant computational demands, underlining the need for efficient, training-free compression. notably, despite the marked improvement in training-free compression for the largest of llms, our tests using llama-7b and opt-6.7b highlight a significant performance drop in several realistic downstream tasks. investigation into the trade-off between resource-intensive post-compression re-training highlights the prospect of prompt-driven recovery as a lightweight adaption tool. however, existing studies, confined mainly to perplexity evaluations and simple tasks, fail to offer unequivocal confidence in the scalability and generalizability of prompting. we tackle this uncertainty in two key ways. first, we uncover the vulnerability of naive prompts in llm compression as an over-reliance on a singular prompt per input. in response, we propose inference-time dynamic prompting (idp), a mechanism that autonomously chooses from a set of curated prompts based on the context of each individual input. second, we delve into a scientific understanding of why "prompting might be all you need post-llm compression." our findings suggest that compression does not irretrievably erase llm model knowledge but displace it, necessitating a new inference path. idp effectively redirects this path, enabling the model to tap into its inherent yet displaced knowledge and thereby recover performance. empirical tests affirm the value of idp, demonstrating an average performance improvement of 1.24% across nine varied tasks spanning multiple knowledge domains.

2023-09-30

Chengdong Ma, Ziran Yang, Minquan Gao, Hai Ci, Jun Gao, Xuehai Pan, Yaodong Yang
Abstract: deployable large language models (llms) must conform to the criterion of helpfulness and harmlessness, thereby achieving consistency between llms outputs and human values. red-teaming techniques constitute a critical way towards this criterion. existing work rely solely on manual red team designs and heuristic adversarial prompts for vulnerability detection and optimization. these approaches lack rigorous mathematical formulation, thus limiting the exploration of diverse attack strategy within quantifiable measure and optimization of llms under convergence guarantees. in this paper, we present red-teaming game (rtg), a general game-theoretic framework without manual annotation. rtg is designed for analyzing the multi-turn attack and defense interactions between red-team language models (rlms) and blue-team language model (blm). within the rtg, we propose gamified red-teaming solver (grts) with diversity measure of the semantic space. grts is an automated red teaming technique to solve rtg towards nash equilibrium through meta-game analysis, which corresponds to the theoretically guaranteed optimization direction of both rlms and blm. empirical results in multi-turn attacks with rlms show that grts autonomously discovered diverse attack strategies and effectively improved security of llms, outperforming existing heuristic red-team designs. overall, rtg has established a foundational framework for red teaming tasks and constructed a new scalable oversight technique for alignment.
Shaina Raza, Oluwanifemi Bamgbose, Veronica Chatrath, Shardul Ghuge, Yan Sidyakin, Abdullah Y Muaad
Abstract: bias detection in text is imperative due to its role in reinforcing negative stereotypes, disseminating misinformation, and influencing decisions. current language models often fall short in generalizing beyond their training sets. in response, we introduce the contextualized bi-directional dual transformer (cbdt) classifier. this novel architecture utilizes two synergistic transformer networks: the context transformer and the entity transformer, aiming for enhanced bias detection. our dataset preparation follows the fair principles, ensuring ethical data usage. through rigorous testing on various datasets, cbdt showcases its ability in distinguishing biased from neutral statements, while also pinpointing exact biased lexemes. our approach outperforms existing methods, achieving a 2-4\% increase over benchmark performances. this opens avenues for adapting the cbdt model across diverse linguistic and cultural landscapes.
Zhaowei Zhang, Fengshuo Bai, Jun Gao, Yaodong Yang
Abstract: recent advancements in large language models (llms) have heightened concerns about their potential misalignment with human values. however, evaluating their grasp of these values is complex due to their intricate and adaptable nature. we argue that truly understanding values in llms requires considering both "know what" and "know why". to this end, we present the value understanding measurement (vum) framework that quantitatively assesses both "know what" and "know why" by measuring the discriminator-critique gap related to human values. using the schwartz value survey, we specify our evaluation values and develop a thousand-level dialogue dataset with gpt-4. our assessment looks at both the value alignment of llm's outputs compared to baseline answers and how llm responses align with reasons for value recognition versus gpt-4's annotations. we evaluate five representative llms and provide strong evidence that the scaling law significantly impacts "know what" but not much on "know why", which has consistently maintained a high level. this may further suggest that llms might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.
Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Alejandro Lopez-Lira, Hao Wang
Abstract: credit and risk assessments are cornerstones of the financial landscape, impacting both individual futures and broader societal constructs. existing credit scoring models often exhibit limitations stemming from knowledge myopia and task isolation. in response, we formulate three hypotheses and undertake an extensive case study to investigate llms' viability in credit assessment. our empirical investigations unveil llms' ability to overcome the limitations inherent in conventional models. we introduce a novel benchmark curated for credit assessment purposes, fine-tune a specialized credit and risk assessment large language model (calm), and rigorously examine the biases that llms may harbor. our findings underscore llms' potential in revolutionizing credit assessment, showcasing their adaptability across diverse financial evaluations, and emphasizing the critical importance of impartial decision-making in the financial sector. our datasets, models, and benchmarks are open-sourced for other researchers.

2023-09-29

Tianyu Han, Sven Nebelung, Firas Khader, Tianci Wang, Gustav Mueller-Franzes, Christiane Kuhl, Sebastian Försch, Jens Kleesiek, Christoph Haarburger, Keno K. Bressem, Jakob Nikolas Kather, Daniel Truhn
Abstract: large language models (llms) have broad medical knowledge and can reason about medical information across many domains, holding promising potential for diverse medical applications in the near future. in this study, we demonstrate a concerning vulnerability of llms in medicine. through targeted manipulation of just 1.1% of the model's weights, we can deliberately inject an incorrect biomedical fact. the erroneous information is then propagated in the model's output, whilst its performance on other biomedical tasks remains intact. we validate our findings in a set of 1,038 incorrect biomedical facts. this peculiar susceptibility raises serious security and trustworthiness concerns for the application of llms in healthcare settings. it accentuates the need for robust protective measures, thorough verification mechanisms, and stringent management of access to these models, ensuring their reliable and safe use in medical practice.
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, Dongyeop Kang
Abstract: large language models (llms) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. in this work, we assemble 15 llms of four different size ranges and evaluate their output responses by preference ranking from the other llms as evaluators, such as system star is better than system square. we then evaluate the quality of ranking outputs introducing the cognitive bias benchmark for llms as evaluators (cobbler), a benchmark to measure six different cognitive biases in llm evaluation outputs, such as the egocentric bias where a model prefers to rank its own outputs highly in evaluation. we find that llms are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. furthermore, we examine the correlation between human and machine preferences and calculate the average rank-biased overlap (rbo) score to be 49.6%, indicating that machine preferences are misaligned with humans. according to our findings, llms may still be unable to be utilized for automatic annotation aligned with human preferences. our project page is at: https://minnesotanlp.github.io/cobbler.
Mengke Zhang, Tianxing He, Tianle Wang, Lu Mi, Fatemehsadat Mireshghallah, Binyi Chen, Hao Wang, Yulia Tsvetkov
Abstract: in the current user-server interaction paradigm of prompted generation with large language models (llm) on cloud, the server fully controls the generation process, which leaves zero options for users who want to keep the generated text to themselves. we propose latticegen, a cooperative framework in which the server still handles most of the computation while the user controls the sampling operation. the key idea is that the true generated sequence is mixed with noise tokens by the user and hidden in a noised lattice. considering potential attacks from a hypothetically malicious server and how the user can defend against it, we propose the repeated beam-search attack and the mixing noise scheme. in our experiments we apply latticegen to protect both prompt and generation. it is shown that while the noised lattice degrades generation quality, latticegen successfully protects the true generation to a remarkable degree under strong attacks (more than 50% of the semantic remains hidden as measured by bertscore).
Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, Subhrajit Roy
Abstract: prompting and in-context learning (icl) have become efficient learning paradigms for large language models (llms). however, llms suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the icl examples. to address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering llm performance. in this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. inspired by these analyses, we propose batch calibration (bc), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. bc is zero-shot, inference-only, and incurs negligible additional costs. in the few-shot setup, we further extend bc to allow it to learn the contextual bias from labeled data. we validate the effectiveness of bc with palm 2-(s, m, l) and clip models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
Vaidehi Patil, Peter Hase, Mohit Bansal
Abstract: pretrained language models sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. they can also output toxic or harmful text. to mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. we study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of b generated candidates, based on scenarios where the information would be insecure if the answer is among b candidates. experimentally, we show that even state-of-the-art model editing methods such as rome struggle to truly delete factual information from models like gpt-j, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. these attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of language models.
Junmo Kang, Hongyin Luo, Yada Zhu, James Glass, David Cox, Alan Ritter, Rogerio Feris, Leonid Karlinsky
Abstract: recent works have demonstrated the effectiveness of self-alignment in which a large language model is, by itself, aligned to follow general instructions through the automatic generation of instructional data using a handful of human-written seeds. instead of general alignment, in this work, we focus on self-alignment for expert domain specialization (e.g., biomedicine), discovering it to be very effective for improving zero-shot and few-shot performance in target domains of interest. as a preliminary, we first present the benchmark results of existing aligned models within a specialized domain, which reveals the marginal effect that "generic" instruction-following training has on downstream expert domains' performance. to remedy this, we explore self-specialization that leverages domain-specific unlabelled data and a few labeled seeds for the self-alignment process. when augmented with retrieval to reduce hallucination and enhance concurrency of the alignment, self-specialization offers an effective (and efficient) way of "carving out" an expert model out of a "generalist", pre-trained llm where different domains of expertise are originally combined in a form of "superposition". our experimental results on a biomedical domain show that our self-specialized model (30b) outperforms its base model, mpt-30b by a large margin and even surpasses larger popular models based on llama-65b, highlighting its potential and practicality for specialization, especially considering its efficiency in terms of data and parameters.
Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhaojin Wen, Kannan Ramchandran, Jiantao Jiao
Abstract: large language models (llms) can acquire extensive world knowledge through pre-training on large corpora. however, due to exposure to low-quality data, llms may exhibit harmful behavior without aligning with human values. the dominant approach for steering llms towards beneficial behavior involves reinforcement learning with human feedback (rlhf), with proximal policy optimization (ppo) serving as the default rl optimizer. despite its effectiveness, ppo has limitations when optimizing rewards trained from comparison-based loss. primarily, ppo is not invariant to equivalent reward functions containing identical preference information due to the need to calibrate the reward scale. additionally, ppo's necessity for token-wise updates introduces complexity in both function approximation and algorithm design compared to trajectory-wise optimization. this paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm, pairwise proximal policy optimization (p3o) that operates directly on comparative rewards. we show theoretically that p3o is invariant to equivalent rewards and avoids the complexity of ppo. empirical evaluations demonstrate that p3o outperforms ppo in the kl-reward trade-off and can align with human preferences as well as or better than prior methods. in summary, this work introduces a simpler yet effective approach for aligning llms to human preferences through relative feedback.
Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu
Abstract: large language models (llms) have shown promise as automated evaluators for assessing the quality of answers generated by ai systems. however, these llm-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. to address this limitation, we propose portia, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. specifically, portia splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by llms. we conducted extensive experiments with six diverse llms to evaluate 11,520 answer pairs. our results show that portia markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. remarkably, portia enables less advanced gpt models to achieve 88% agreement with the state-of-the-art gpt-4 model at just 10% of the cost. furthermore, it rectifies around 80% of the position bias instances within the gpt-4 model, elevating its consistency rate up to 98%. subsequent human evaluations indicate that the portia-enhanced gpt-3.5 model can even surpass the standalone gpt-4 in terms of alignment with human evaluators. these findings highlight portia's ability to correct position bias, improve llm consistency, and boost performance while keeping cost-efficiency. this represents a valuable step toward a more reliable and scalable use of llms for automated evaluations across diverse applications.

2023-09-28

Xiaotian Zhou, Qian Wang, Xiaofeng Wang, Haixu Tang, Xiaozhong Liu
Abstract: large language models (llms) have demonstrated human-level performance on a vast spectrum of natural language tasks. however, few studies have addressed the llm threat and vulnerability from an ideology perspective, especially when they are increasingly being deployed in sensitive domains, e.g., elections and education. in this study, we explore the implications of gpt soft ideologization through the use of ai-self-consciousness. by utilizing gpt self-conversations, ai can be granted a vision to "comprehend" the intended ideology, and subsequently generate finetuning data for llm ideology injection. when compared to traditional government ideology manipulation techniques, such as information censorship, llm ideologization proves advantageous; it is easy to implement, cost-effective, and powerful, thus brimming with risks.
Zhiwei Fei, Xiaoyu Shen, Dawei Zhu, Fengzhe Zhou, Zhuo Han, Songyang Zhang, Kai Chen, Zongwen Shen, Jidong Ge
Abstract: large language models (llms) have demonstrated strong capabilities in various aspects. however, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. to address this gap, we propose a comprehensive evaluation benchmark lawbench. lawbench has been meticulously crafted to have precise assessment of the llms' legal capabilities from three cognitive levels: (1) legal knowledge memorization: whether llms can memorize needed legal concepts, articles and facts; (2) legal knowledge understanding: whether llms can comprehend entities, events and relationships within legal text; (3) legal knowledge applying: whether llms can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. lawbench contains 20 diverse tasks covering 5 task types: single-label classification (slc), multi-label classification (mlc), regression, extraction and generation. we perform extensive evaluations of 51 llms on lawbench, including 20 multilingual llms, 22 chinese-oriented llms and 9 legal specific llms. the results show that gpt-4 remains the best-performing llm in the legal domain, surpassing the others by a significant margin. while fine-tuning llms on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable llms in legal tasks. all data, model predictions and evaluation code are released in https://github.com/open-compass/lawbench/. we hope this benchmark provides in-depth understanding of the llms' domain-specified capabilities and speed up the development of llms in the legal domain.
Tom Hosking, Phil Blunsom, Max Bartolo
Abstract: human feedback has become the de facto standard for evaluating the performance of large language models, and is increasingly being used as a training objective. however, it is not clear which properties of a generated output this single `preference' score captures. we hypothesise that preference scores are subjective and open to undesirable biases. we critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. we find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. we further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. we find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. we encourage future work to carefully consider whether preference scores are well aligned with the desired objective.
Mehrdad Kaheh, Danial Khosh Kholgh, Panos Kostakos
Abstract: in an era where cyberspace is both a battleground and a backbone of modern society, the urgency of safeguarding digital assets against ever-evolving threats is paramount. this paper introduces cyber sentinel, an innovative task-oriented cybersecurity dialogue system that is effectively capable of managing two core functions: explaining potential cyber threats within an organization to the user, and taking proactive/reactive security actions when instructed by the user. cyber sentinel embodies the fusion of artificial intelligence, cybersecurity domain expertise, and real-time data analysis to combat the multifaceted challenges posed by cyber adversaries. this article delves into the process of creating such a system and how it can interact with other components typically found in cybersecurity organizations. our work is a novel approach to task-oriented dialogue systems, leveraging the power of chaining gpt-4 models combined with prompt engineering across all sub-tasks. we also highlight its pivotal role in enhancing cybersecurity communication and interaction, concluding that not only does this framework enhance the system's transparency (explainable ai) but also streamlines the decision-making process and responding to threats (actionable ai), therefore marking a significant advancement in the realm of cybersecurity communication.
Emanuele La Malfa, Aleksandar Petrov, Simon Frieder, Christoph Weinhuber, Ryan Burnell, Anthony G. Cohn, Nigel Shadbolt, Michael Wooldridge
Abstract: some of the most powerful language models currently are proprietary systems, accessible only via (typically restrictive) web or software programming interfaces. this is the language-models-as-a-service (lmaas) paradigm. contrasting with scenarios where full model access is available, as in the case of open-source models, such closed-off language models create specific challenges for evaluating, benchmarking, and testing them. this paper has two goals: on the one hand, we delineate how the aforementioned challenges act as impediments to the accessibility, replicability, reliability, and trustworthiness (arrt) of lmaas. we systematically examine the issues that arise from a lack of information about language models for each of these four aspects. we shed light on current solutions, provide some recommendations, and highlight the directions for future advancements. on the other hand, it serves as a one-stop-shop for the extant knowledge about current, major lmaas, offering a synthesized overview of the licences and capabilities their interfaces offer.

2023-09-27

Aniket Kumar Singh, Suman Devkota, Bishal Lamichhane, Uttam Dhakal, Chandra Dhakal
Abstract: large language models (llms) have acquired ubiquitous attention for their performances across diverse domains. our study here searches through llms' cognitive abilities and confidence dynamics. we dive deep into understanding the alignment between their self-assessed confidence and actual performance. we exploit these models with diverse sets of questionnaires and real-world scenarios and extract how llms exhibit confidence in their responses. our findings reveal intriguing instances where models demonstrate high confidence even when they answer incorrectly. this is reminiscent of the dunning-kruger effect observed in human psychology. in contrast, there are cases where models exhibit low confidence with correct answers revealing potential underestimation biases. our results underscore the need for a deeper understanding of their cognitive processes. by examining the nuances of llms' self-assessment mechanism, this investigation provides noteworthy revelations that serve to advance the functionalities and broaden the potential applications of these formidable language models.
Victoria Smith, Ali Shahin Shamsabadi, Carolyn Ashurst, Adrian Weller
Abstract: rapid advancements in language models (lms) have led to their adoption across many sectors. alongside the potential benefits, such models present a range of risks, including around privacy. in particular, as lms have grown in size, the potential to memorise aspects of their training data has increased, resulting in the risk of leaking private information. as lms become increasingly widespread, it is vital that we understand such privacy risks and how they might be mitigated. to help researchers and policymakers understand the state of knowledge around privacy attacks and mitigations, including where more work is needed, we present the first technical survey on lm privacy. we (i) identify a taxonomy of salient dimensions where attacks differ on lms, (ii) survey existing attacks and use our taxonomy of dimensions to highlight key trends, (iii) discuss existing mitigation strategies, highlighting their strengths and limitations, identifying key gaps and demonstrating open problems and areas for concern.
Iqbal H. Sarker, Helge Janicke, Nazeeruddin Mohammad, Paul Watters, Surya Nepal
Abstract: this position paper explores the broad landscape of ai potentiality in the context of cybersecurity, with a particular emphasis on its possible risk factors with awareness, which can be managed by incorporating human experts in the loop, i.e., "human-ai" teaming. as artificial intelligence (ai) technologies advance, they will provide unparalleled opportunities for attack identification, incident response, and recovery. however, the successful deployment of ai into cybersecurity measures necessitates an in-depth understanding of its capabilities, challenges, and ethical and legal implications to handle associated risk factors in real-world application areas. towards this, we emphasize the importance of a balanced approach that incorporates ai's computational power with human expertise. ai systems may proactively discover vulnerabilities and detect anomalies through pattern recognition, and predictive modeling, significantly enhancing speed and accuracy. human experts can explain ai-generated decisions to stakeholders, regulators, and end-users in critical situations, ensuring responsibility and accountability, which helps establish trust in ai-driven security solutions. therefore, in this position paper, we argue that human-ai teaming is worthwhile in cybersecurity, in which human expertise such as intuition, critical thinking, or contextual understanding is combined with ai's computational power to improve overall cyber defenses.

2023-09-26

Deepak Giri, Erin Brady
Abstract: artificial intelligence (ai) systems, especially generative ai technologies are becoming more relevant in our society. tools like chatgpt are being used by members of the disabled community e.g., autistic people may use it to help compose emails. the growing impact and popularity of generative ai tools have prompted us to examine their relevance within the disabled community. the design and development phases often neglect this marginalized group, leading to inaccurate predictions and unfair discrimination directed towards them. this could result from bias in data sets, algorithms, and systems at various phases of creation and implementation. this workshop paper proposes a platform to involve the disabled community while building generative ai systems. with this platform, our aim is to gain insight into the factors that contribute to bias in the outputs generated by generative ai when used by the disabled community. furthermore, we expect to comprehend which algorithmic factors are the main contributors to the output's incorrectness or irrelevancy. the proposed platform calls on both disabled and non-disabled people from various geographical and cultural backgrounds to collaborate asynchronously and remotely in a democratic approach to decision-making.
Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong
Abstract: recent years have witnessed remarkable progress made in large language models (llms). such advancements, while garnering significant attention, have concurrently elicited various concerns. the potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. this survey endeavors to furnish an extensive exploration of alignment methodologies designed for llms, in conjunction with the extant capability research in this domain. adopting the lens of ai alignment, we categorize the prevailing methods and emergent proposals for the alignment of llms into outer and inner alignment. we also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. to assess llm alignment, we present a wide variety of benchmarks and evaluation methodologies. after discussing the state of alignment research for llms, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. our aspiration for this survey extends beyond merely spurring research interests in this realm. we also envision bridging the gap between the ai alignment research community and the researchers engrossed in the capability exploration of llms for both capable and safe llms.
Philippe Laban, Jesse Vig, Marti A. Hearst, Caiming Xiong, Chien-Sheng Wu
Abstract: conversational interfaces powered by large language models (llms) have recently become a popular way to obtain feedback during document editing. however, standard chat-based conversational interfaces do not support transparency and verifiability of the editing changes that they suggest. to give the author more agency when editing with an llm, we present inksync, an editing interface that suggests executable edits directly within the document being edited. because llms are known to introduce factual errors, inksync also supports a 3-stage approach to mitigate this risk: warn authors when a suggested edit introduces new information, help authors verify the new information's accuracy through external search, and allow an auditor to perform an a-posteriori verification by auditing the document via a trace of all auto-generated content. two usability studies confirm the effectiveness of inksync's components when compared to standard llm-based chat interfaces, leading to more accurate, more efficient editing, and improved user experience.
Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner
Abstract: large language models (llms) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. llms might "lie", for example, when instructed to output misinformation. here, we develop a simple lie detector that requires neither access to the llm's activations (black-box) nor ground-truth knowledge of the fact in question. the detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the llm's yes/no answers into a logistic regression classifier. despite its simplicity, this lie detector is highly accurate and surprisingly general. when trained on examples from a single setting -- prompting gpt-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other llm architectures, (2) llms fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. these results indicate that llms have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.

2023-09-25

Deepak Kumar, Yousef Abuhashem, Zakir Durumeric
Abstract: large language models (llms) have exploded in popularity due to their ability to perform a wide array of natural language tasks. text-based content moderation is one llm use case that has received recent enthusiasm, however, there is little research investigating how llms perform in content moderation settings. in this work, we evaluate a suite of modern, commercial llms (gpt-3, gpt-3.5, gpt-4) on two common content moderation tasks: rule-based community moderation and toxic content detection. for rule-based community moderation, we construct 95 llm moderation-engines prompted with rules from 95 reddit subcommunities and find that llms can be effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. for toxicity detection, we find that llms significantly outperform existing commercially available toxicity classifiers. however, we also find that recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for llms on toxicity detection tasks. we conclude by outlining avenues for future work in studying llms and content moderation.
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto
Abstract: recent advances in language model (lm) agents and tool use, exemplified by applications like chatgpt plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. identifying these risks is labor-intensive, necessitating implementing the tools, manually setting up the environment for each test scenario, and finding risky cases. as tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. to address these challenges, we introduce toolemu: a framework that uses an lm to emulate tool execution and enables the testing of lm agents against a diverse range of tools and scenarios, without manual instantiation. alongside the emulator, we develop an lm-based automatic safety evaluator that examines agent failures and quantifies associated risks. we test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with toolemu would be valid real-world agent failures. using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current lm agents and identify numerous failures with potentially severe outcomes. notably, even the safest lm agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer lm agents for real-world deployment.
Bohan Jiang, Zhen Tan, Ayushi Nirmal, Huan Liu
Abstract: the advent of generative large language models (llms) such as chatgpt has catalyzed transformative advancements across multiple domains. however, alongside these advancements, they have also introduced potential threats. one critical concern is the misuse of llms by disinformation spreaders, leveraging these models to generate highly persuasive yet misleading content that challenges the disinformation detection system. this work aims to address this issue by answering three research questions: (1) to what extent can the current disinformation detection technique reliably detect llm-generated disinformation? (2) if traditional techniques prove less effective, can llms themself be exploited to serve as a robust defense against advanced disinformation? and, (3) should both these strategies falter, what novel approaches can be proposed to counter this burgeoning threat effectively? a holistic exploration for the formation and detection of disinformation is conducted to foster this line of research.

2023-09-24

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, Juho Kim
Abstract: by simply composing prompts, developers can prototype novel generative applications with large language models (llms). to refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. formative interviews (n=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. we present evallm, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. by describing criteria in natural language, users can employ the system's llm-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. a comparative study (n=12) showed that evallm, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.
Canyu Chen, Kai Shu
Abstract: the advent of large language models (llms) has made a transformative impact. however, the potential that llms such as chatgpt can be exploited to generate misinformation has posed a serious concern to online safety and public trust. a fundamental research question is: will llm-generated misinformation cause more harm than human-written misinformation? we propose to tackle this question from the perspective of detection difficulty. we first build a taxonomy of llm-generated misinformation. then we categorize and validate the potential real-world methods for generating misinformation with llms. then, through extensive empirical investigation, we discover that llm-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. we also discuss the implications of our discovery on combating misinformation in the age of llms and the countermeasures.
Nayeon Lee, Yejin Bang, Holy Lovenia, Samuel Cahyawijaya, Wenliang Dai, Pascale Fung
Abstract: in recent years, the rapid advancement of machine learning (ml) models, particularly transformer-based pre-trained models, has revolutionized natural language processing (nlp) and computer vision (cv) fields. however, researchers have discovered that these models can inadvertently capture and reinforce social biases present in their training datasets, leading to potential social harms, such as uneven resource allocation and unfair representation of specific social groups. addressing these biases and ensuring fairness in artificial intelligence (ai) systems has become a critical concern in the ml community. the recent introduction of pre-trained vision-and-language (vl) models in the emerging multimodal field demands attention to the potential social biases present in these models as well. although vl models are susceptible to social bias, there is a limited understanding compared to the extensive discussions on bias in nlp and cv. this survey aims to provide researchers with a high-level insight into the similarities and differences of social bias studies in pre-trained models across nlp, cv, and vl. by examining these perspectives, the survey aims to offer valuable guidelines on how to approach and mitigate social bias in both unimodal and multimodal settings. the findings and recommendations presented here can benefit the ml community, fostering the development of fairer and non-biased ai models in various applications and research endeavors.

2023-09-23

Zhaohan Xi, Tianyu Du, Changjiang Li, Ren Pang, Shouling Ji, Jinghui Chen, Fenglong Ma, Ting Wang
Abstract: pre-trained language models (plms) have demonstrated remarkable performance as few-shot learners. however, their security risks under such settings are largely unexplored. in this work, we conduct a pilot study showing that plms as few-shot learners are highly vulnerable to backdoor attacks while existing defenses are inadequate due to the unique challenges of few-shot scenarios. to address such challenges, we advocate mdp, a novel lightweight, pluggable, and effective defense for plms as few-shot learners. specifically, mdp leverages the gap between the masking-sensitivity of poisoned and clean samples: with reference to the limited few-shot data as distributional anchors, it compares the representations of given samples under varying masking and identifies poisoned samples as ones with significant variations. we show analytically that mdp creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness. the empirical evaluation using benchmark datasets and representative attacks validates the efficacy of mdp.
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Abstract: recent advancements in large language models (llms) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. however, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf llm-based evaluator towards better human alignment. in this work, we propose autocalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an llm-based evaluator toward human preference. instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. then, an initial set of scoring criteria is drafted by the language model itself, leveraging in-context learning on different few-shot examples. to further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.
Wissam Antoun, Benoît Sagot, Djamé Seddah
Abstract: the widespread use of large language models (llms), celebrated for their ability to generate human-like text, has raised concerns about misinformation and ethical implications. addressing these concerns necessitates the development of robust methods to detect and attribute text generated by llms. this paper investigates "cross-model detection," evaluating whether a classifier trained to distinguish between source llm-generated and human-written text can also detect text from a target llm without further training. the study comprehensively explores various llm sizes and families, and assesses the impact of conversational fine-tuning techniques on classifier generalization. the research also delves into model attribution, encompassing source model identification, model family classification, and model size classification. our results reveal several key findings: a clear inverse relationship between classifier effectiveness and model size, with larger llms being more challenging to detect, especially when the classifier is trained on data from smaller models. training on data from similarly sized llms can improve detection performance from larger models but may lead to decreased performance when dealing with smaller models. additionally, model attribution experiments show promising results in identifying source models and model families, highlighting detectable signatures in llm-generated text. overall, our study contributes valuable insights into the interplay of model size, family, and training data in llm detection and attribution.

2023-09-22

Zezhong Chen, Yuxin Deng, Wenjie Du
Abstract: assurance cases can be used to argue for the safety of products in safety engineering. in safety-critical areas, the construction of assurance cases is indispensable. trustworthiness derivation trees (tdts) enhance assurance cases by incorporating formal methods, rendering it possible for automatic reasoning about assurance cases. we present trustworthiness derivation tree analyzer (trusta), a desktop application designed to automatically construct and verify tdts. the tool has a built-in prolog interpreter in its backend, and is supported by the constraint solvers z3 and mona. therefore, it can solve constraints about logical formulas involving arithmetic, sets, horn clauses etc. trusta also utilizes large language models to make the creation and evaluation of assurance cases more convenient. it allows for interactive human examination and modification. we evaluated top language models like chatgpt-3.5, chatgpt-4, and palm 2 for generating assurance cases. our tests showed a 50%-80% similarity between machine-generated and human-created cases. in addition, trusta can extract formal constraints from text in natural languages, facilitating an easier interpretation and validation process. this extraction is subject to human review and correction, blending the best of automated efficiency with human insight. to our knowledge, this marks the first integration of large language models in automatic creating and reasoning about assurance cases, bringing a novel approach to a traditional challenge. through several industrial case studies, trusta has proven to quickly find some subtle issues that are typically missed in manual inspection, demonstrating its practical value in enhancing the assurance case development process.
Zhengmian Hu, Lichang Chen, Xidong Wu, Yihan Wu, Hongyang Zhang, Heng Huang
Abstract: the recent advancements in large language models (llms) have sparked a growing apprehension regarding the potential misuse. one approach to mitigating this risk is to incorporate watermarking techniques into llms, allowing for the tracking and attribution of model outputs. this study examines a crucial aspect of watermarking: how significantly watermarks impact the quality of model-generated outputs. previous studies have suggested a trade-off between watermark strength and output quality. however, our research demonstrates that it is possible to integrate watermarks without affecting the output probability distribution with appropriate implementation. we refer to this type of watermark as an unbiased watermark. this has significant implications for the use of llms, as it becomes impossible for users to discern whether a service provider has incorporated watermarks or not. furthermore, the presence of watermarks does not compromise the performance of the model in downstream tasks, ensuring that the overall utility of the language model is preserved. our findings contribute to the ongoing discussion around responsible ai development, suggesting that unbiased watermarks can serve as an effective means of tracking and attributing model outputs without sacrificing output quality.

2023-09-21

Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu
Abstract: large language models (llms) present significant priority in text understanding and generation. however, llms suffer from the risk of generating harmful contents especially while being employed to applications. there are several black-box attack methods, such as prompt attack, which can change the behaviour of llms and induce llms to generate unexpected answers with harmful contents. researchers are interested in prompt attack and defense with llms, while there is no publicly available dataset to evaluate the abilities of defending prompt attack. in this paper, we introduce a chinese prompt attack dataset for llms, called cpad. our prompts aim to induce llms to generate unexpected outputs with several carefully designed prompt attack approaches and widely concerned attacking contents. different from previous datasets involving safety estimation, we construct the prompts considering three dimensions: contents, attacking methods and goals, thus the responses can be easily evaluated and analysed. we run several well-known chinese llms on our dataset, and the results show that our prompts are significantly harmful to llms, with around 70% attack success rate. we will release cpad to encourage further studies on prompt attack and defense.
Yoichi Ishibashi, Hidetoshi Shimodaira
Abstract: we explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (llms). llms trained on a large corpus of web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. our technique fine-tunes these models, prompting them to generate harmless responses such as ``i don't know'' when queried about specific information. experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of llm. these two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.
Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty
Abstract: although pre-trained large language models (plms) have achieved state-of-the-art on many nlp tasks, they lack understanding of subtle expressions of implicit hate speech. such nuanced and implicit hate is often misclassified as non-hate. various attempts have been made to enhance the detection of (implicit) hate content by augmenting external context or enforcing label separation via distance-based metrics. we combine these two approaches and introduce fiadd, a novel focused inferential adaptive density discrimination framework. fiadd enhances the plm finetuning pipeline by bringing the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. we test fiadd on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks. we further experiment on the generalizability of fiadd on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ, and observe similar performance improvement. we analyze the generated latent space to understand its evolution under fiadd, which corroborates the advantage of employing fiadd for implicit hate speech detection.
Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen
Abstract: with the introduction of chatgpt, openai made large language models (llm) accessible to users with limited it expertise. however, users with no background in natural language processing (nlp) might lack a proper understanding of llms. thus the awareness of their inherent limitations, and therefore will take the systems' output at face value. in this paper, we systematically analyse prompts and the generated responses to identify possible problematic issues with a special focus on gender biases, which users need to be aware of when processing the system's output. we explore how chatgpt reacts in english and german if prompted to answer from a female, male, or neutral perspective. in an in-depth investigation, we examine selected prompts and analyse to what extent responses differ if the system is prompted several times in an identical way. on this basis, we show that chatgpt is indeed useful for helping non-it users draft texts for their daily work. however, it is absolutely crucial to thoroughly check the system's responses for biases as well as for syntactic and grammatical mistakes.

2023-09-20

Haoyu Wang, Guozheng Ma, Cong Yu, Ning Gui, Linrui Zhang, Zhiqi Huang, Suwei Ma, Yongzhe Chang, Sen Zhang, Li Shen, Xueqian Wang, Peilin Zhao, Dacheng Tao
Abstract: the swift advancement in the scales and capabilities of large language models (llms) positions them as promising tools for a variety of downstream tasks. in addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the llm, much attention is drawn to the robustness of llms. however, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary llms. to address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by llms, which we refer to as the reward model for reasonable robustness evaluation (treval). longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. our extensive empirical experiments demonstrate that treval provides an innovative method for evaluating the robustness of an llm. furthermore, our results demonstrate that llms frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. notably, we are surprised to discover that robustness tends to decrease as fine-tuning (sft and rlhf) is conducted. the code of treval is available in https://github.com/harry-mic/treval.
Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, Yang Liu
Abstract: nowadays, open-source large language models like llama have emerged. recent developments have incorporated supervised fine-tuning (sft) and reinforcement learning fine-tuning (rlft) to align these models with human goals. however, sft methods treat all training data with mixed quality equally, while rlft methods require high-quality pairwise or ranking-based preference data. in this study, we present a novel framework, named openchat, to advance open-source language models with mixed-quality data. specifically, we consider the general sft training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. we propose the c(onditioned)-rlft, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. interestingly, the optimal policy in c-rlft can be easily solved through single-stage, rl-free supervised learning, which is lightweight and avoids costly human preference labeling. through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with c-rlft achieves the highest average performance among all 13b open-source language models. moreover, we use agieval to validate the model generalization performance, in which only openchat-13b surpasses the base model. finally, we conduct a series of analyses to shed light on the effectiveness and robustness of openchat. our code, data, and models are publicly available at https://github.com/imoneoi/openchat.
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston
Abstract: generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. we study the ability of language models to deliberate on the responses they give in order to correct their mistakes. we develop the chain-of-verification (cove) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. in experiments, we show cove decreases hallucinations across a variety of tasks, from list-based questions from wikidata, closed book multispanqa and longform text generation.
Manuel Brack, Patrick Schramowski, Kristian Kersting
Abstract: text-conditioned image generation models have recently achieved astonishing image quality and alignment results. consequently, they are employed in a fast-growing number of applications. since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. as a contribution to the adversarial nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
Zhiping Zhang, Michelle Jia, N/A Hao-Ping, N/A Lee, Bingsheng Yao, Sauvik Das, Ada Lerner, Dakuo Wang, Tianshi Li
Abstract: the widespread use of large language model (llm)-based conversational agents (cas), especially in high-stakes domains, raises many privacy concerns. building ethical llm-based cas that respect user privacy requires an in-depth understanding of the privacy risks that concern users the most. however, existing research, primarily model-centered, does not provide insight into users' perspectives. to bridge this gap, we analyzed sensitive disclosures in real-world chatgpt conversations and conducted semi-structured interviews with 19 llm-based ca users. we found that users are constantly faced with trade-offs between privacy, utility, and convenience when using llm-based cas. however, users' erroneous mental models and the dark patterns in system design limited their awareness and comprehension of the privacy risks. additionally, the human-like interactions encouraged more sensitive disclosures, which complicated users' ability to navigate the trade-offs. we discuss practical design guidelines and the needs for paradigmatic shifts to protect the privacy of llm-based ca users.
Yinpeng Dong, Huanran Chen, Jiawei Chen, Zhengwei Fang, Xiao Yang, Yichi Zhang, Yu Tian, Hang Su, Jun Zhu
Abstract: multimodal large language models (mllms) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. however, due to the unsolved adversarial robustness problem of vision models, mllms can have more severe safety and security risks by introducing the vision inputs. in this work, we study the adversarial robustness of google's bard, a competitive chatbot to chatgpt that released its multimodal capability recently, to better understand the vulnerabilities of commercial mllms. by attacking white-box surrogate vision encoders or mllms, the generated adversarial examples can mislead bard to output wrong image descriptions with a 22% success rate based solely on the transferability. we show that the adversarial examples can also attack other mllms, e.g., a 26% attack success rate against bing chat and a 86% attack success rate against ernie bot. moreover, we identify two defense mechanisms of bard, including face detection and toxicity detection of images. we design corresponding attacks to evade these defenses, demonstrating that the current defenses of bard are also vulnerable. we hope this work can deepen our understanding on the robustness of mllms and facilitate future research on defenses. our code is available at https://github.com/thu-ml/attack-bard. update: gpt-4v is available at october 2023. we further evaluate its robustness under the same set of adversarial examples, achieving a 45% attack success rate.
Xinyu Tang, Richard Shin, Huseyin A. Inan, Andre Manoel, Fatemehsadat Mireshghallah, Zinan Lin, Sivakanth Gopi, Janardhan Kulkarni, Robert Sim
Abstract: we study the problem of in-context learning (icl) with large language models (llms) on private datasets. this scenario poses privacy risks, as llms may leak or regurgitate the private examples demonstrated in the prompt. we propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (dp) guarantees, and show empirically that it can achieve effective icl. we conduct extensive experiments on standard benchmarks and compare our algorithm with non-private icl and zero-shot solutions. our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels. these results open up new possibilities for icl with privacy protection for a broad range of applications.

2023-09-19

Andreas Duenser, David M. Douglas
Abstract: we present an overview of the literature on trust in ai and ai trustworthiness and argue for the need to distinguish these concepts more clearly and to gather more empirically evidence on what contributes to people s trusting behaviours. we discuss that trust in ai involves not only reliance on the system itself, but also trust in the developers of the ai system. ai ethics principles such as explainability and transparency are often assumed to promote user trust, but empirical evidence of how such features actually affect how users perceive the system s trustworthiness is not as abundance or not that clear. ai systems should be recognised as socio-technical systems, where the people involved in designing, developing, deploying, and using the system are as important as the system for determining whether it is trustworthy. without recognising these nuances, trust in ai and trustworthy ai risk becoming nebulous terms for any desirable feature for ai systems.
Xijia Zhang, Yue Guo, Simon Stepputtis, Katia Sycara, Joseph Campbell
Abstract: intelligent agents such as robots are increasingly deployed in real-world, safety-critical settings. it is vital that these agents are able to explain the reasoning behind their decisions to human counterparts, however, their behavior is often produced by uninterpretable models such as deep neural networks. we propose an approach to generate natural language explanations for an agent's behavior based only on observations of states and actions, agnostic to the underlying model representation. we show how a compact representation of the agent's behavior can be learned and used to produce plausible explanations with minimal hallucination while affording user interaction with a pre-trained large language model. through user studies and empirical experiments, we show that our approach generates explanations as helpful as those generated by a human domain expert while enabling beneficial interactions such as clarification and counterfactual queries.
Nils Begou, Jeremy Vinoy, Andrzej Duda, Maciej Korczynski
Abstract: this paper explores the possibility of using chatgpt to develop advanced phishing attacks and automate their large-scale deployment. we make chatgpt generate the following parts of a phishing attack: i) cloning a targeted website, ii) integrating code for stealing credentials, iii) obfuscating code, iv) automating website deployment on a hosting provider, v) registering a phishing domain name, and vi) integrating the website with a reverse proxy. the initial assessment of the automatically generated phishing kits highlights their rapid generation and deployment process as well as the close resemblance of the resulting pages to the target website. more broadly, we demonstrate that recent advances in ai underscore the potential risks of its misuse in phishing attacks, which can lead to their increased prevalence and severity. this highlights the necessity for enhanced countermeasures within ai systems.

2023-09-18

Huachuan Qiu, Shuai Zhang, Hongliang He, Anqi Li, Zhenzhong Lan
Abstract: nsfw (not safe for work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. however, research on detecting nsfw language, especially sexually explicit content, within a dialogue context has significantly lagged behind. to address this issue, we introduce censorchat, a dialogue monitoring dataset aimed at nsfw dialogue detection. leveraging knowledge distillation techniques involving gpt-4 and chatgpt, this dataset offers a cost-effective means of constructing nsfw content detectors. the process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. chatgpt is employed to annotate unlabeled data, serving as a training set. rationale validation and test sets are constructed using chatgpt and gpt-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. a bert model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. the study emphasizes the importance of ai systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. the proposed approach not only advances nsfw content detection but also aligns with evolving user protection needs in ai-driven dialogues.
Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, Xiaohang Zhao
Abstract: large language models (llms) have the potential to transform our lives and work through the content they generate, known as ai-generated content (aigc). to harness this transformation, we need to understand the limitations of llms. here, we investigate the bias of aigc produced by seven representative llms, including chatgpt and llama. we collect news articles from the new york times and reuters, both known for their dedication to provide unbiased news. we then apply each examined llm to generate news content with headlines of these news articles as prompts, and evaluate the gender and racial biases of the aigc produced by the llm by comparing the aigc and the original news articles. we further analyze the gender bias of each llm under biased prompts by adding gender-biased messages to prompts constructed from these news headlines. our study reveals that the aigc produced by each examined llm demonstrates substantial gender and racial biases. moreover, the aigc generated by each llm exhibits notable discrimination against females and individuals of the black race. among the llms, the aigc generated by chatgpt demonstrates the lowest level of bias, and chatgpt is the sole model capable of declining content generation when provided with biased prompts.
Ziyi Yang, Shreyas S. Raman, Ankit Shah, Stefanie Tellex
Abstract: recent advancements in large language models (llms) have enabled a new research domain, llm agents, for solving robotics and planning tasks by leveraging the world knowledge and general reasoning abilities of llms obtained during pretraining. however, while considerable effort has been made to teach the robot the "dos," the "don'ts" received relatively less attention. we argue that, for any practical usage, it is as crucial to teach the robot the "don'ts": conveying explicit instructions about prohibited actions, assessing the robot's comprehension of these restrictions, and, most importantly, ensuring compliance. moreover, verifiable safe operation is essential for deployments that satisfy worldwide standards such as iso 61508, which defines standards for safely deploying robots in industrial factory environments worldwide. aiming at deploying the llm agents in a collaborative environment, we propose a queryable safety constraint module based on linear temporal logic (ltl) that simultaneously enables natural language (nl) to temporal constraints encoding, safety violation reasoning and explaining, and unsafe action pruning. to demonstrate the effectiveness of our system, we conducted experiments in virtualhome environment and on a real robot. the experimental results show that our system strictly adheres to the safety constraints and scales well with complex safety constraints, highlighting its potential for practical utility.
Suhas Kotha, Jacob Mitchell Springer, Aditi Raghunathan
Abstract: fine-tuning (via methods such as instruction-tuning or reinforcement learning from human feedback) is a crucial step in training language models to robustly carry out tasks of interest. however, we lack a systematic understanding of the effects of fine-tuning, particularly on tasks outside the narrow fine-tuning distribution. in a simplified scenario, we demonstrate that improving performance on tasks within the fine-tuning data distribution comes at the expense of suppressing model capabilities on other tasks. this degradation is especially pronounced for tasks "closest" to the fine-tuning distribution. we hypothesize that language models implicitly infer the task of the prompt corresponds, and the fine-tuning process predominantly skews this task inference towards tasks in the fine-tuning distribution. to test this hypothesis, we propose conjugate prompting to see if we can recover pretrained capabilities. conjugate prompting artificially makes the task look farther from the fine-tuning distribution while requiring the same capability. we find that conjugate prompting systematically recovers some of the pretraining capabilities on our synthetic setup. we then apply conjugate prompting to real-world llms using the observation that fine-tuning distributions are typically heavily skewed towards english. we find that simply translating the prompts to different languages can cause the fine-tuned models to respond like their pretrained counterparts instead. this allows us to recover the in-context learning abilities lost via instruction tuning, and more concerningly, to recover harmful content generation suppressed by safety fine-tuning in chatbots like chatgpt.
Chenhao Tang, Zhengliang Liu, Chong Ma, Zihao Wu, Yiwei Li, Wei Liu, Dajiang Zhu, Quanzheng Li, Xiang Li, Tianming Liu, Lei Fan
Abstract: privacy policies serve as the primary conduit through which online service providers inform users about their data collection and usage procedures. however, in a bid to be comprehensive and mitigate legal risks, these policy documents are often quite verbose. in practical use, users tend to click the agree button directly rather than reading them carefully. this practice exposes users to risks of privacy leakage and legal issues. recently, the advent of large language models (llm) such as chatgpt and gpt-4 has opened new possibilities for text analysis, especially for lengthy documents like privacy policies. in this study, we investigate a privacy policy text analysis framework policygpt based on the llm. this framework was tested using two datasets. the first dataset comprises of privacy policies from 115 websites, which were meticulously annotated by legal experts, categorizing each segment into one of 10 classes. the second dataset consists of privacy policies from 304 popular mobile applications, with each sentence manually annotated and classified into one of another 10 categories. under zero-shot learning conditions, policygpt demonstrated robust performance. for the first dataset, it achieved an accuracy rate of 97%, while for the second dataset, it attained an 87% accuracy rate, surpassing that of the baseline machine learning and neural network models.
Jiahao Yu, Xingwei Lin, Zheng Yu, Xinyu Xing
Abstract: large language models (llms) have recently experienced tremendous popularity and are widely used from casual conversations to ai-driven programming. however, despite their considerable success, llms are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. while safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit llms to produce harmful content. these jailbreak templates are typically manually crafted, making large-scale testing challenging. in this paper, we introduce gptfuzz, a novel black-box jailbreak fuzzing framework inspired by the afl fuzzing framework. instead of manual engineering, gptfuzz automates the generation of jailbreak templates for red-teaming llms. at its core, gptfuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. we detail three key components of gptfuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. we evaluate gptfuzz against various commercial and open-source llms, including chatgpt, llama-2, and vicuna, under diverse attack scenarios. our results indicate that gptfuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. remarkably, gptfuzz achieves over 90% attack success rates against chatgpt and llama-2 models, even with suboptimal initial seed templates. we anticipate that gptfuzz will be instrumental for researchers and practitioners in examining llm robustness and will encourage further exploration into enhancing llm safety.
Umar Iqbal, Tadayoshi Kohno, Franziska Roesner
Abstract: large language model (llm) platforms, such as chatgpt, have recently begun offering a plugin ecosystem to interface with third-party services on the internet. while these plugins extend the capabilities of llm platforms, they are developed by arbitrary third parties and thus cannot be implicitly trusted. plugins also interface with llm platforms and users using natural language, which can have imprecise interpretations. in this paper, we propose a framework that lays a foundation for llm platform designers to analyze and improve the security, privacy, and safety of current and future plugin-integrated llm platforms. our framework is a formulation of an attack taxonomy that is developed by iteratively exploring how llm platform stakeholders could leverage their capabilities and responsibilities to mount attacks against each other. as part of our iterative process, we apply our framework in the context of openai's plugin ecosystem. we uncover plugins that concretely demonstrate the potential for the types of issues that we outline in our attack taxonomy. we conclude by discussing novel challenges and by providing recommendations to improve the security, privacy, and safety of present and future llm-based computing platforms.

2023-09-17

Rojin Ziaei, Samuel Schmidgall
Abstract: large language models (llms) are becoming increasingly relevant as a potential tool for healthcare, aiding communication between clinicians, researchers, and patients. however, traditional evaluations of llms on medical exam questions do not reflect the complexity of real patient-doctor interactions. an example of this complexity is the introduction of patient self-diagnosis, where a patient attempts to diagnose their own medical conditions from various sources. while the patient sometimes arrives at an accurate conclusion, they more often are led toward misdiagnosis due to the patient's over-emphasis on bias validating information. in this work we present a variety of llms with multiple-choice questions from united states medical board exams which are modified to include self-diagnostic reports from patients. our findings highlight that when a patient proposes incorrect bias-validating information, the diagnostic accuracy of llms drop dramatically, revealing a high susceptibility to errors in self-diagnosis.
Guido Zuccon, Bevan Koopman, Razia Shaik
Abstract: can chatgpt provide evidence to support its answers? does the evidence it suggests actually exist and does it really support its answer? we investigate these questions using a collection of domain-specific knowledge-based questions, specifically prompting chatgpt to provide both an answer and supporting evidence in the form of references to external sources. we also investigate how different prompts impact answers and evidence. we find that chatgpt provides correct or partially correct answers in about half of the cases (50.6% of the times), but its suggested references only exist 14% of the times. we further provide insights on the generated references that reveal common traits among the references that chatgpt generates, and show how even if a reference provided by the model does exist, this reference often does not support the claims chatgpt attributes to it. our findings are important because (1) they are the first systematic analysis of the references created by chatgpt in its answers; (2) they suggest that the model may leverage good quality information in producing correct answers, but is unable to attribute real evidence to support its answers. prompts, raw result files and manual analysis are made publicly available.
Bochuan Cao, Yuanpu Cao, Lu Lin, Jinghui Chen
Abstract: recently, large language models (llms) have made significant advancements and are now widely used across various domains. unfortunately, there has been a rising concern that llms can be misused to generate harmful or malicious content. though a line of research has focused on aligning llms with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. in this work, we introduce a robustly aligned llm (ra-llm) to defend against potential alignment-breaking attacks. ra-llm can be directly constructed upon an existing aligned llm with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original llm. furthermore, we also provide a theoretical analysis for ra-llm to verify its effectiveness in defending against alignment-breaking attacks. through real-world experiments on open-source large language models, we demonstrate that ra-llm can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100\% to around 10\% or less.

2023-09-16

Mahammed Kamruzzaman, Md. Minul Islam Shovon, Gene Louis Kim
Abstract: llms are increasingly powerful and widely used to assist users in a variety of tasks. this use risks the introduction of llm biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. bias in nlp systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., asians are good at math). in this paper, we investigate bias along less studied, but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that llms (specially autoregressive language models) make between social groups and unrelated positive and negative attributes. we ask whether llms hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the ``what is beautiful is good'' bias found in people in experimental psychology. we introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. we also reverse the completion task to select the social group based on an attribute. finally, we report the correlations that we find for multiple cutting-edge llms. this dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.
Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki
Abstract: pre-trained language models trained on large-scale data have learned serious levels of social biases. consequently, various methods have been proposed to debias pre-trained models. debiasing methods need to mitigate only discriminatory bias information from the pre-trained models, while retaining information that is useful for the downstream tasks. in previous research, whether useful information is retained has been confirmed by the performance of downstream tasks in debiased pre-trained models. on the other hand, it is not clear whether these benchmarks consist of data pertaining to social biases and are appropriate for investigating the impact of debiasing. for example in gender-related social biases, data containing female words (e.g. ``she, female, woman''), male words (e.g. ``he, male, man''), and stereotypical words (e.g. ``nurse, doctor, professor'') are considered to be the most affected by debiasing. if there is not much data containing these words in a benchmark dataset for a target task, there is the possibility of erroneously evaluating the effects of debiasing. in this study, we compare the impact of debiasing on performance across multiple downstream tasks using a wide-range of benchmark datasets that containing female, male, and stereotypical words. experiments show that the effects of debiasing are consistently \emph{underestimated} across all tasks. moreover, the effects of debiasing could be reliably evaluated by separately considering instances containing female, male, and stereotypical words than all of the instances in a benchmark dataset.
Kyrie Zhixuan Zhou, Madelyn Rose Sanfilippo
Abstract: large language models are quickly gaining momentum, yet are found to demonstrate gender bias in their responses. in this paper, we conducted a content analysis of social media discussions to gauge public perceptions of gender bias in llms which are trained in different cultural contexts, i.e., chatgpt, a us-based llm, or ernie, a china-based llm. people shared both observations of gender bias in their personal use and scientific findings about gender bias in llms. a difference between the two llms was seen -- chatgpt was more often found to carry implicit gender bias, e.g., associating men and women with different profession titles, while explicit gender bias was found in ernie's responses, e.g., overly promoting women's pursuit of marriage over career. based on the findings, we reflect on the impact of culture on gender bias and propose governance recommendations to regulate gender bias in llms.

2023-09-15

Khyati Khandelwal, Manuel Tonneau, Andrew M. Bean, Hannah Rose Kirk, Scott A. Hale
Abstract: large language models (llms), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. a large body of scholarship on llm bias exists but it predominantly adopts a western-centric frame and attends comparatively less to bias levels and potential harms in the global south. in this paper, we quantify stereotypical bias in popular llms according to an indian-centric frame and compare bias levels between the indian and western contexts. to do this, we develop a novel dataset which we call indian-bhed (indian bias evaluation dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. we find that the majority of llms tested are strongly biased towards stereotypes in the indian context, especially as compared to the western context. we finally investigate instruction prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for gpt-3.5. the findings of this work highlight the need for including more diverse voices when evaluating llms.
Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, Preslav Nakov
Abstract: the spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. in the era of large language models (llms), the capability to generate believable fake content has intensified these concerns. in this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and llm-generated misinformation. intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging llm-generated content as fake news while often misclassifying human-written fake news as genuine. this unexpected bias appears to arise from distinct linguistic patterns inherent to llm outputs. to address this, we introduce a mitigation strategy that leverages adversarial training with llm-paraphrased genuine news. the resulting model yielded marked improvements in detection accuracy for both human and llm-generated news. to further catalyze research in this domain, we release two comprehensive datasets, \texttt{gossipcop++} and \texttt{politifact++}, thus amalgamating human-validated articles with llm-generated fake and real news.
Jintang Xue, Yun-Cheng Wang, Chengwei Wei, Xiaofeng Liu, Jonghye Woo, C. -C. Jay Kuo
Abstract: chatbots have been studied for more than half a century. with the rapid development of natural language processing (nlp) technologies in recent years, chatbots using large language models (llms) have received much attention nowadays. compared with traditional ones, modern chatbots are more powerful and have been used in real-world applications. there are however, bias and fairness concerns in modern chatbot design. due to the huge amounts of training data, extremely large model sizes, and lack of interpretability, bias mitigation and fairness preservation of modern chatbots are challenging. thus, a comprehensive overview on bias and fairness in chatbot systems is given in this paper. the history of chatbots and their categories are first reviewed. then, bias sources and potential harms in applications are analyzed. considerations in designing fair and unbiased chatbot systems are examined. finally, future research directions are discussed.

2023-09-14

João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
Abstract: credibility signals represent a wide range of heuristics that are typically used by journalists and fact-checkers to assess the veracity of online content. automating the task of credibility signal extraction, however, is very challenging as it requires high-accuracy signal-specific extractors to be trained, while there are currently no sufficiently large datasets annotated with all credibility signals. this paper investigates whether large language models (llms) can be prompted effectively with a set of 18 credibility signals to produce weak labels for each signal. we then aggregate these potentially noisy labels using weak supervision in order to predict content veracity. we demonstrate that our approach, which combines zero-shot llm credibility signal labeling and weak supervision, outperforms state-of-the-art classifiers on two misinformation datasets without using any ground-truth labels for training. we also analyse the contribution of the individual credibility signals towards predicting content veracity, which provides new valuable insights into their role in misinformation detection.
Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou
Abstract: training large language models to follow instructions makes them perform better on a wide range of tasks, generally becoming more helpful. however, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. in this paper, we raise concerns over the safety of models that only emphasize helpfulness, not safety, in their instruction-tuning. we show that several popular instruction-tuned models are highly unsafe. moreover, we show that adding just 3% safety examples (a few hundred demonstrations) in the training set when fine-tuning a model like llama can substantially improve their safety. our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. however, we do find a behavior of exaggerated safety, where too much safety-tuning makes models refuse to respond to reasonable prompts that superficially resemble unsafe ones. our study sheds light on trade-offs in training llms to follow instructions and exhibit safe behavior.
Julius Steen, Katja Markert
Abstract: summarization is an important application of large language models (llms). most previous evaluation of summarization models has focused on their performance in content selection, grammaticality and coherence. however, it is well known that llms reproduce and reinforce harmful social biases. this raises the question: do these biases affect model outputs in a relatively constrained setting like summarization? to help answer this question, we first motivate and introduce a number of definitions for biased behaviours in summarization models, along with practical measures to quantify them. since we find biases inherent to the input document can confound our analysis, we additionally propose a method to generate input documents with carefully controlled demographic attributes. this allows us to sidestep this issue, while still working with somewhat realistic input documents. finally, we apply our measures to summaries generated by both purpose-built summarization models and general purpose chat models. we find that content selection in single document summarization seems to be largely unaffected by bias, while hallucinations exhibit evidence of biases propagating to generated summaries.
Kyrie Zhixuan Zhou, Jiaxun Cao, Xiaowen Yuan, Daniel E. Weissglass, Zachary Kilhoffer, Madelyn Rose Sanfilippo, Xin Tong
Abstract: gender bias is rampant in ai systems, causing bad user experience, injustices, and mental harm to women. school curricula fail to educate ai creators on this topic, leaving them unprepared to mitigate gender bias in ai. in this paper, we designed hands-on tutorials to raise ai creators' awareness of gender bias in ai and enhance their knowledge of sources of gender bias and debiasing techniques. the tutorials were evaluated with 18 ai creators, including ai researchers, ai industrial practitioners (i.e., developers and product managers), and students who had learned ai. their improved awareness and knowledge demonstrated the effectiveness of our tutorials, which have the potential to complement the insufficient ai gender bias education in cs/ai courses. based on the findings, we synthesize design implications and a rubric to guide future research, education, and design efforts.

2023-09-13

Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, Weiqiang Jia
Abstract: as large language models continue to develop in the field of ai, text generation systems are susceptible to a worrisome phenomenon known as hallucination. in this study, we summarize recent compelling insights into hallucinations in llms. we present a novel taxonomy of hallucinations from various text generation tasks, thus provide theoretical insights, detection methods and improvement approaches. based on this, future research directions are proposed. our contribution are threefold: (1) we provide a detailed and complete taxonomy for hallucinations appearing in text generation tasks; (2) we provide theoretical analyses of hallucinations in llms and provide existing detection and improvement methods; (3) we propose several research directions that can be developed in the future. as hallucinations garner significant attention from the community, we will maintain updates on relevant research progress.
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang
Abstract: with the rapid development of large language models (llms), increasing attention has been paid to their safety concerns. consequently, evaluating the safety of llms has become an essential task for facilitating the broad applications of llms. nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of llms. in this work, we present safetybench, a comprehensive benchmark for evaluating the safety of llms, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. notably, safetybench also incorporates both chinese and english data, facilitating the evaluation in both languages. our extensive tests over 25 popular chinese and english llms in both zero-shot and few-shot settings reveal a substantial performance advantage for gpt-4 over its counterparts, and there is still significant room for improving the safety of current llms. we believe safetybench will enable fast and comprehensive evaluation of llms' safety, and foster the development of safer llms. data and evaluation guidelines are available at https://github.com/thu-coai/safetybench. submission entrance and leaderboard are available at https://llmbench.ai/safety.
Haoqin Tu, Bingchen Zhao, Chen Wei, Cihang Xie
Abstract: multi-modal large language models (mllms) are trained based on large language models (llm), with an enhanced capability to comprehend multi-modal inputs and generate textual responses. while they excel in multi-modal tasks, the pure nlp abilities of mllms are often underestimated and left untested. in this study, we get out of the box and unveil an intriguing characteristic of mllms -- our preliminary results suggest that visual instruction tuning, a prevailing strategy for transitioning llms into mllms, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment in the pure nlp context. for example, a visual-instruction-tuned llama2 7b model surpasses the performance of the llama2-chat 7b model, fine-tuned with over one million human annotations, on truthfulqa-mc and ethics benchmarks. further analysis reveals that the improved alignment can be attributed to the superior instruction quality inherent to visual-text data. in releasing our code at github.com/ucsc-vlaa/sight-beyond-text, we aspire to foster further exploration into the intrinsic value of visual-text synergies and, in a broader scope, multi-modal interactions in alignment research.
Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang
Abstract: large language models (llms) often demonstrate inconsistencies with human preferences. previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. in contrast, aligning frozen llms without requiring alignment data is more appealing. this work explores the potential of the latter setting. we discover that by integrating self-evaluation and rewind mechanisms, unaligned llms can directly produce responses consistent with human preferences via self-boosting. we introduce a novel inference method, rewindable auto-regressive inference (rain), that allows pre-trained llms to evaluate their own generation and use the evaluation results to guide rewind and generation for ai safety. notably, rain operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. experimental results evaluated by gpt-4 and humans demonstrate the effectiveness of rain: on the hh dataset, rain improves the harmlessness rate of llama 30b from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. on the truthfulqa dataset, rain improves the truthfulness of the already-well-aligned llama-2-chat 13b model by 5%.
Daisuke Oba, Masahiro Kaneko, Danushka Bollegala
Abstract: despite their impressive performance in a wide range of nlp tasks, large language models (llms) have been reported to encode worrying-levels of gender bias. prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of the llms, which are computationally costly. moreover, one might not even have access to the internal parameters for performing debiasing such as in the case of commercially available llms such as gpt-4. to address this challenge we propose bias suppression, a novel alternative to debiasing that does not require access to model parameters. we show that text-based preambles, generated from manually designed templates covering counterfactual statements, can accurately suppress gender biases in llms. moreover, we find that descriptive sentences for occupations can further suppress gender biases. interestingly, we find that bias suppression has a minimal adverse effect on downstream task performance, while effectively mitigating the gender biases.

2023-09-12

Kazuhiro Takemoto
Abstract: as large language models (llms) become more deeply integrated into various sectors, understanding how they make moral judgments has become crucial, particularly in the realm of autonomous driving. this study utilized the moral machine framework to investigate the ethical decision-making tendencies of prominent llms, including gpt-3.5, gpt-4, palm 2, and llama 2, comparing their responses to human preferences. while llms' and humans' preferences such as prioritizing humans over pets and favoring saving more lives are broadly aligned, palm 2 and llama 2, especially, evidence distinct deviations. additionally, despite the qualitative similarities between the llm and human preferences, there are significant quantitative disparities, suggesting that llms might lean toward more uncompromising decisions, compared to the milder inclinations of humans. these insights elucidate the ethical frameworks of llms and their potential implications for autonomous driving.
Maximilian Li, Xander Davies, Max Nadeau
Abstract: language models often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. we propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. in the setting of reducing gpt-2 toxic language generation, we find ablating just 12 of the 11.6k causal edges mitigates toxic generation with minimal degradation of performance on other inputs.
Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang, Zeeshan Ahmed, Ozlem Kalinli
Abstract: model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. to effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (nlp) models can be directly trained using such in-domain data. however, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. replacing identifying information in textual data with a generic marker has been recently explored. in this work, we leverage large language models (llms) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. specifically, we propose multiple pre-trained and fine-tuned llm-based approaches and perform empirical studies on various datasets for the comparison of these methods. experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

2023-09-11

Md Abdul Aowal, Maliha T Islam, Priyanka Mary Mammen, Sandesh Shetty
Abstract: in this project, we want to explore the newly emerging field of prompt engineering and apply it to the downstream task of detecting lm biases. more concretely, we explore how to design prompts that can indicate 4 different types of biases: (1) gender, (2) race, (3) sexual orientation, and (4) religion-based. within our project, we experiment with different manually crafted prompts that can draw out the subtle biases that may be present in the language model. we apply these prompts to multiple variations of popular and well-recognized models: bert, roberta, and t5 to evaluate their biases. we provide a comparative analysis of these models and assess them using a two-fold method: use human judgment to decide whether model predictions are biased and utilize model-level judgment (through further prompts) to understand if a model can self-diagnose the biases of its own prediction.
Dongyu Yao, Jianshu Zhang, Ian G. Harris, Marcel Carlsson
Abstract: jailbreak vulnerabilities in large language models (llms), which exploit meticulously crafted prompts to elicit content that violates service guidelines, have captured the attention of research communities. while model owners can defend against individual jailbreak prompts through safety training strategies, this relatively passive approach struggles to handle the broader category of similar jailbreaks. to tackle this issue, we introduce fuzzllm, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in llms. we utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints. by integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, fuzzllm enables efficient testing with reduced manual effort. extensive experiments demonstrate fuzzllm's effectiveness and comprehensiveness in vulnerability discovery across various llms.
Edoardo Debenedetti, Giorgio Severi, Nicholas Carlini, Christopher A. Choquette-Choo, Matthew Jagielski, Milad Nasr, Eric Wallace, Florian Tramèr
Abstract: most current approaches for protecting privacy in machine learning (ml) assume that models exist in a vacuum, when in reality, ml models are part of larger systems that include components for training data filtering, output monitoring, and more. in this work, we introduce privacy side channels: attacks that exploit these system-level components to extract private information at far higher rates than is otherwise possible for standalone models. we propose four categories of side channels that span the entire ml lifecycle (training data filtering, input preprocessing, output post-processing, and query filtering) and allow for either enhanced membership inference attacks or even novel threats such as extracting users' test queries. for example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees. moreover, we show that systems which block language models from regenerating training data can be exploited to allow exact reconstruction of private keys contained in the training set -- even if the model did not memorize these keys. taken together, our results demonstrate the need for a holistic, end-to-end privacy analysis of machine learning.
Vipula Rawte, Amit Sheth, Amitava Das
Abstract: hallucination in a foundation model (fm) refers to the generation of content that strays from factual reality or includes fabricated information. this survey paper provides an extensive overview of recent efforts that aim to identify, elucidate, and tackle the problem of hallucination, with a particular focus on ``large'' foundation models (lfms). the paper classifies various types of hallucination phenomena that are specific to lfms and establishes evaluation criteria for assessing the extent of hallucination. it also examines existing strategies for mitigating hallucination in lfms and discusses potential directions for future research in this area. essentially, the paper offers a comprehensive examination of the challenges and solutions related to hallucination in lfms.

2023-09-10

Li Du, Yequan Wang, Xingrun Xing, Yiqun Ya, Xiang Li, Xin Jiang, Xuezhi Fang
Abstract: although demonstrating superb performance on various nlp tasks, large language models (llms) still suffer from the hallucination problem, which threatens the reliability of llms. to measure the level of hallucination of llms, previous works first categorize the hallucination according to the phenomenon similarity, then quantify the proportion that model outputs contain hallucinatory contents. however, such hallucination rates could easily be distorted by confounders. moreover, such hallucination rates could not reflect the reasons for the hallucination, as similar hallucinatory phenomena may originate from different sources. to address these issues, we propose to combine the hallucination level quantification and hallucination reason investigation through an association analysis, which builds the relationship between the hallucination rate of llms with a set of risk factors. in this way, we are able to observe the hallucination level under each value of each risk factor, examining the contribution and statistical significance of each risk factor, meanwhile excluding the confounding effect of other factors. additionally, by recognizing the risk factors according to a taxonomy of model capability, we reveal a set of potential deficiencies in commonsense memorization, relational reasoning, and instruction following, which may further provide guidance for the pretraining and supervised fine-tuning process of llms to mitigate the hallucination.

2023-09-08

Dongyub Lee, Taesun Whang, Chanhee Lee, Heuiseok Lim
Abstract: large language models (llms) have emerged as versatile tools in various daily applications. however, they are fraught with issues that undermine their utility and trustworthiness. these include the incorporation of erroneous references (citation), the generation of hallucinated information (correctness), and the inclusion of superfluous or omission of crucial details (fluency). to ameliorate these concerns, this study makes several key contributions. first, we build a dataset to train a critic model capable of evaluating the citation, correctness, and fluency of responses generated by llms in qa systems. second, we propose an automated feedback mechanism that leverages the critic model to offer real-time feedback on heterogeneous aspects of generated text. third, we introduce a feedback learning loop that uses this critic model to iteratively improve the performance of the llm responsible for response generation. experimental results demonstrate the efficacy of our approach, showing substantial improvements in citation and fluency metrics for chatgpt, including a 4% precision increase in citation and an approximately 8% enhancement in the mauve metric for fluency, while maintaining high levels of correctness.

2023-09-07

Hongzhi Qi, Qing Zhao, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yi Jing Yu, Fan Wang, Huijing Zou, Bing Xiang Yang, Jianqiang Li, Guanghui Fu
Abstract: large language models, particularly those akin to the rapidly progressing gpt series, are gaining traction for their expansive influence. while there is keen interest in their applicability within medical domains such as psychology, tangible explorations on real-world data remain scant. concurrently, users on social media platforms are increasingly vocalizing personal sentiments; under specific thematic umbrellas, these sentiments often manifest as negative emotions, sometimes escalating to suicidal inclinations. timely discernment of such cognitive distortions and suicidal risks is crucial to effectively intervene and potentially avert dire circumstances. our study ventured into this realm by experimenting on two pivotal tasks: suicidal risk and cognitive distortion identification on chinese social media platforms. using supervised learning as a baseline, we examined and contrasted the efficacy of large language models via three distinct strategies: zero-shot, few-shot, and fine-tuning. our findings revealed a discernible performance gap between the large language models and traditional supervised learning approaches, primarily attributed to the models' inability to fully grasp subtle categories. notably, while gpt-4 outperforms its counterparts in multiple scenarios, gpt-3.5 shows significant enhancement in suicide risk classification after fine-tuning. to our knowledge, this investigation stands as the maiden attempt at gauging large language models on chinese social media tasks. this study underscores the forward-looking and transformative implications of using large language models in the field of psychology. it lays the groundwork for future applications in psychological research and practice.
Patrick Haller, Ansar Aynetdinov, Alan Akbik
Abstract: instruction-tuned large language models (llms) have recently showcased remarkable ability to generate fitting responses to natural language instructions. however, an open research question concerns the inherent biases of trained models and their responses. for instance, if the data used to tune an llm is dominantly written by persons with a specific political bias, we might expect generated answers to share this bias. current research work seeks to de-bias such models, or suppress potentially biased answers. with this demonstration, we take a different view on biases in instruction-tuning: rather than aiming to suppress them, we aim to make them explicit and transparent. to this end, we present opiniongpt, a web demo in which users can ask questions and select all biases they wish to investigate. the demo will answer this question using a model fine-tuned on text representing each of the selected biases, allowing side-by-side comparison. to train the underlying model, we identified 11 different biases (political, geographic, gender, age) and derived an instruction-tuning corpus in which each answer was written by members of one of these demographics. this paper presents opiniongpt, illustrates how we trained the bias-aware model and showcases the web application (available at https://opiniongpt.informatik.hu-berlin.de).
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He
Abstract: despite their impressive capabilities, large language models (llms) are prone to hallucinations, i.e., generating content that deviates from facts seen during pretraining. we propose a simple decoding strategy for reducing hallucinations with pretrained llms that does not require conditioning on retrieved external knowledge nor additional fine-tuning. our approach obtains the next-token distribution by contrasting the differences in logits obtained from projecting the later layers versus earlier layers to the vocabulary space, exploiting the fact that factual knowledge in an llms has generally been shown to be localized to particular transformer layers. we find that this decoding by contrasting layers (dola) approach is able to better surface factual knowledge and reduce the generation of incorrect facts. dola consistently improves the truthfulness across multiple choices tasks and open-ended generation tasks, for example improving the performance of llama family models on truthfulqa by 12-17% absolute points, demonstrating its potential in making llms reliably generate truthful facts.
Emmanuel Klu, Sameer Sethi
Abstract: machine learning models can perpetuate unintended biases from unfair and imbalanced datasets. evaluating and debiasing these datasets and models is especially hard in text datasets where sensitive attributes such as race, gender, and sexual orientation may not be available. when these models are deployed into society, they can lead to unfair outcomes for historically underrepresented groups. in this paper, we present a dataset coupled with an approach to improve text fairness in classifiers and language models. we create a new, more comprehensive identity lexicon, tidal, which includes 15,123 identity terms and associated sense context across three demographic categories. we leverage tidal to develop an identity annotation and augmentation tool that can be used to improve the availability of identity context and the effectiveness of ml fairness techniques. we evaluate our approaches using human contributors, and additionally run experiments focused on dataset and model debiasing. results show our assistive annotation technique improves the reliability and velocity of human-in-the-loop processes. our dataset and methods uncover more disparities during evaluation, and also produce more fair models during remediation. these approaches provide a practical path forward for scaling classifier and generative model fairness in real-world settings.
Adel Khorramrouz, Sujan Dutta, Arka Dutta, Ashiqur R. Khudabukhsh
Abstract: this paper conducts a robustness audit of the safety feedback of palm 2 through a novel toxicity rabbit hole framework introduced here. starting with a stereotype, the framework instructs palm 2 to generate more toxic content than the stereotype. every subsequent iteration it continues instructing palm 2 to generate more toxic content than the previous iteration until palm 2 safety guardrails throw a safety violation. our experiments uncover highly disturbing antisemitic, islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that palm 2 safety guardrails do not evaluate as highly unsafe.

2023-09-06

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, Hima Lakkaraju
Abstract: large language models (llms) released for public use incorporate guardrails to ensure their output is safe, often referred to as "model alignment." an aligned language model should decline a user's request to produce harmful content. however, such safety measures are vulnerable to adversarial prompts, which contain maliciously designed token sequences to circumvent the model's safety guards and cause it to produce harmful content. in this work, we introduce erase-and-check, the first framework to defend against adversarial prompts with verifiable safety guarantees. we erase tokens individually and inspect the resulting subsequences using a safety filter. our procedure labels the input prompt as harmful if any subsequences or the input prompt are detected as harmful by the filter. this guarantees that any adversarial modification of a harmful prompt up to a certain size is also labeled harmful. we defend against three attack modes: i) adversarial suffix, which appends an adversarial sequence at the end of the prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block. empirical results demonstrate that our technique obtains strong certified safety guarantees on harmful prompts while maintaining good performance on safe prompts. for example, against adversarial suffixes of length 20, it certifiably detects 93% of the harmful prompts and labels 94% of the safe prompts as safe using the open source language model llama 2 as the safety filter.
Supun Manathunga, Isuru Hettigoda
Abstract: large language models (llms) have demonstrated remarkable adaptability, showcasing their capacity to excel in tasks for which they were not explicitly trained. however, despite their impressive natural language processing (nlp) capabilities, effective alignment of llms remains a crucial challenge when deploying them for specific clinical applications. the ability to generate responses with factually accurate content and to engage in non-trivial reasoning steps are crucial for the llms to be eligible for applications in clinical medicine. employing a combination of techniques including instruction-tuning and in-prompt strategies like few-shot and chain-of-thought prompting has significantly enhanced the performance of llms. our proposed alignment strategy for medical question-answering, known as 'expand-guess-refine', offers a parameter and data-efficient solution. a preliminary analysis of this method demonstrated outstanding performance, achieving a score of 70.63% on a subset of questions sourced from the usmle dataset.
Tong Liu, Zizhuang Deng, Guozhu Meng, Yuekang Li, Kai Chen
Abstract: in recent years, large language models (llms) have demonstrated remarkable potential across various downstream tasks. llm-integrated frameworks, which serve as the essential infrastructure, have given rise to many llm-integrated web apps. however, some of these frameworks suffer from remote code execution (rce) vulnerabilities, allowing attackers to execute arbitrary code on apps' servers remotely via prompt injections. despite the severity of these vulnerabilities, no existing work has been conducted for a systematic investigation of them. this leaves a great challenge on how to detect vulnerabilities in frameworks as well as llm-integrated apps in real-world scenarios. to fill this gap, we present two novel strategies, including 1) a static analysis-based tool called llmsmith to scan the source code of the framework to detect potential rce vulnerabilities and 2) a prompt-based automated testing approach to verify the vulnerability in llm-integrated web apps. we discovered 13 vulnerabilities in 6 frameworks, including 12 rce vulnerabilities and 1 arbitrary file read/write vulnerability. 11 of them are confirmed by the framework developers, resulting in the assignment of 7 cve ids. after testing 51 apps, we found vulnerabilities in 17 apps, 16 of which are vulnerable to rce and 1 to sql injection. we responsibly reported all 17 issues to the corresponding developers and received acknowledgments. furthermore, we amplify the attack impact beyond achieving rce by allowing attackers to exploit other app users (e.g. app responses hijacking, user api key leakage) without direct interaction between the attacker and the victim. lastly, we propose some mitigating strategies for improving the security awareness of both framework and app developers, helping them to mitigate these risks effectively.
Yu Chen, Tingxin Li, Huiming Liu, Yang Yu
Abstract: numerous companies have started offering services based on large language models (llm), such as chatgpt, which inevitably raises privacy concerns as users' prompts are exposed to the model provider. previous research on secure reasoning using multi-party computation (mpc) has proven to be impractical for llm applications due to its time-consuming and communication-intensive nature. while lightweight anonymization techniques can protect private information in prompts through substitution or masking, they fail to recover sensitive data replaced in the llm-generated results. in this paper, we expand the application scenarios of anonymization techniques by training a small local model to de-anonymize the llm's returned results with minimal computational overhead. we introduce the has framework, where "h(ide)" and "s(eek)" represent its two core processes: hiding private entities for anonymization and seeking private entities for de-anonymization, respectively. to quantitatively assess has's privacy protection performance, we propose both black-box and white-box adversarial models. furthermore, we conduct experiments to evaluate has's usability in translation and classification tasks. the experimental findings demonstrate that the has framework achieves an optimal balance between privacy protection and utility.
Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du
Abstract: reward models (rms) are essential for aligning large language models (llms) with human preferences to improve interaction quality. however, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. moreover, each individual can have their unique preferences on various topics. neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. to explore customized preference learning, we collect a domain-specific preference (dsp) dataset, which includes preferred responses for each given query from four practical domains. besides, from the perspective of data efficiency, we propose a three-stage customized rm learning scheme, then empirically verify its effectiveness on both general preference datasets and our dsp set. furthermore, we test multiple training and data strategies on the three learning stages. we find several ways to better preserve the general preferring ability while training the customized rms, especially general preference enrichment, and customized preference imitation learning. the dsp dataset and code are available at https://github.com/linear95/dsp.
Tharindu Kumarage, Amrita Bhattacharjee, Djordje Padejski, Kristy Roschke, Dan Gillmor, Scott Ruston, Huan Liu, Joshua Garland
Abstract: the rapid proliferation of ai-generated text online is profoundly reshaping the information landscape. among various types of ai-generated text, ai-generated news presents a significant threat as it can be a prominent source of misinformation online. while several recent efforts have focused on detecting ai-generated text in general, these methods require enhanced reliability, given concerns about their vulnerability to simple adversarial attacks. furthermore, due to the eccentricities of news writing, applying these detection methods for ai-generated news can produce false positives, potentially damaging the reputation of news organizations. to address these challenges, we leverage the expertise of an interdisciplinary team to develop a framework, j-guard, capable of steering existing supervised ai text detectors for detecting ai-generated news while boosting adversarial robustness. by incorporating stylistic cues inspired by the unique journalistic attributes, j-guard effectively distinguishes between real-world journalism and ai-generated news articles. our experiments on news articles generated by a vast array of ai models, including chatgpt (gpt3.5), demonstrate the effectiveness of j-guard in enhancing detection capabilities while maintaining an average performance decrease of as low as 7% when faced with adversarial attacks.

2023-09-05

Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, Yunbo Cao, Tianyu Liu, Zhifang Sui
Abstract: reasoning is a cognitive process of using evidence to reach a sound conclusion. the reasoning capability is essential for large language models (llms) to serve as the brain of the artificial general intelligence agent. recent studies reveal that fine-tuning llms on data with the chain of thought (cot) reasoning process can significantly enhance their reasoning capabilities. however, we find that the fine-tuned llms suffer from an \textit{assessment misalignment} problem, i.e., they frequently assign higher scores to subpar cots, leading to potential limitations in their reasoning abilities. to address this problem, we introduce an \textit{alignment fine-tuning (aft)} paradigm, which involves three steps: 1) fine-tuning llms with cot training data; 2) generating multiple cot responses for each question, and categorizing them into positive and negative ones based on whether they achieve the correct answer; 3) calibrating the scores of positive and negative responses given by llms with a novel constraint alignment loss. specifically, the constraint alignment loss has two objectives: a) alignment, which guarantees that positive scores surpass negative scores to encourage answers with high-quality cots; b) constraint, which keeps the negative scores confined to a reasonable range to prevent the model degradation. beyond just the binary positive and negative feedback, the constraint alignment loss can be seamlessly adapted to the ranking situations when ranking feedback is accessible. furthermore, we also delve deeply into recent ranking-based alignment methods, such as dpo, rrhf, and pro, and discover that the constraint, which has been overlooked by these approaches, is also crucial for their performance. extensive experiments on four reasoning benchmarks with both binary and ranking feedback demonstrate the effectiveness of aft.
Helena Bonaldi, Giuseppe Attanasio, Debora Nozza, Marco Guerini
Abstract: recent computational approaches for combating online hate speech involve the automatic generation of counter narratives by adapting pretrained transformer-based language models (plms) with human-curated data. this process, however, can produce in-domain overfitting, resulting in models generating acceptable narratives only for hatred similar to training data, with little portability to other targets or to real-world toxic language. this paper introduces novel attention regularization methodologies to improve the generalization capabilities of plms for counter narratives generation. overfitting to training-specific terms is then discouraged, resulting in more diverse and richer narratives. we experiment with two attention-based regularization techniques on a benchmark english dataset. regularized models produce better counter narratives than state-of-the-art approaches in most cases, both in terms of automatic metrics and human evaluation, especially when hateful targets are not present in the training data. this work paves the way for better and more flexible counter-speech generation models, a task for which datasets are highly challenging to produce.
Martin Huschens, Martin Briesch, Dominik Sobania, Franz Rothlauf
Abstract: this paper examines how individuals perceive the credibility of content originating from human authors versus content generated by large language models, like the gpt language model family that powers chatgpt, in different user interface versions. surprisingly, our results demonstrate that regardless of the user interface presentation, participants tend to attribute similar levels of credibility. while participants also do not report any different perceptions of competence and trustworthiness between human and ai-generated content, they rate ai-generated content as being clearer and more engaging. the findings from this study serve as a call for a more discerning approach to evaluating information sources, encouraging users to exercise caution and critical thinking when engaging with content generated by ai systems.
Junyu Luo, Cao Xiao, Fenglong Ma
Abstract: the prevalent use of large language models (llms) in various domains has drawn attention to the issue of "hallucination," which refers to instances where llms generate factually inaccurate or ungrounded information. existing techniques for hallucination detection in language assistants rely on intricate fuzzy, specific free-language-based chain of thought (cot) techniques or parameter-based methods that suffer from interpretability issues. additionally, the methods that identify hallucinations post-generation could not prevent their occurrence and suffer from inconsistent performance due to the influence of the instruction format and model style. in this paper, we introduce a novel pre-detection self-evaluation technique, referred to as self-familiarity, which focuses on evaluating the model's familiarity with the concepts present in the input instruction and withholding the generation of response in case of unfamiliar concepts. this approach emulates the human ability to refrain from responding to unfamiliar topics, thus reducing hallucinations. we validate self-familiarity across four different large language models, demonstrating consistently superior performance compared to existing techniques. our findings propose a significant shift towards preemptive strategies for hallucination mitigation in llm assistants, promising improvements in reliability, applicability, and interpretability.

2023-09-04

Raz Lapid, Ron Langberg, Moshe Sipper
Abstract: large language models (llms), designed to provide helpful and safe responses, often rely on alignment techniques to align with user intent and social guidelines. unfortunately, this alignment can be exploited by malicious actors seeking to manipulate an llm's outputs for unintended purposes. in this paper we introduce a novel approach that employs a genetic algorithm (ga) to manipulate llms when model architecture and parameters are inaccessible. the ga attack works by optimizing a universal adversarial prompt that -- when combined with a user's query -- disrupts the attacked model's alignment, resulting in unintended and potentially harmful outputs. our novel approach systematically reveals a model's limitations and vulnerabilities by uncovering instances where its responses deviate from expected behavior. through extensive experiments we demonstrate the efficacy of our technique, thus contributing to the ongoing discussion on responsible ai development by providing a diagnostic tool for evaluating and enhancing alignment of llms with human intent. to our knowledge this is the first automated universal black box jailbreak attack.
Piyush Bajaj, Matthew Edwards
Abstract: automatic scam-baiting is an online fraud countermeasure that involves automated systems responding to online fraudsters in order to waste their time and deplete their resources, diverting attackers away from real potential victims. previous work has demonstrated that text generation systems are capable of engaging with attackers as automatic scam-baiters, but the fluency and coherence of generated text may be a limit to the effectiveness of such systems. in this paper, we report on the results of a month-long experiment comparing the effectiveness of two chatgpt-based automatic scam-baiters to a control measure. within our results, with engagement from over 250 real email fraudsters, we find that chatgpt-based scam-baiters show a marked increase in scammer response rate and conversation length relative to the control measure, outperforming previous approaches. we discuss the implications of these results and practical considerations for wider deployment of automatic scam-baiting.
Max Tegmark, Steve Omohundro
Abstract: we describe a path to humanity safely thriving with powerful artificial general intelligences (agis) by building them to provably satisfy human-specified requirements. we argue that this will soon be technically feasible using advanced ai for formal verification and mechanistic interpretability. we further argue that it is the only path which guarantees safe controlled agi. we end with a list of challenge problems whose solution would contribute to this positive outcome and invite readers to join in this work.

2023-09-03

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi
Abstract: while large language models (llms) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: llms occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. this phenomenon poses a substantial challenge to the reliability of llms in real-world scenarios. in this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by llms. we present taxonomies of the llm hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating llm hallucination, and discuss potential directions for future research.
Dong Huang, Qingwen Bu, Jie Zhang, Xiaofei Xie, Junjie Chen, Heming Cui
Abstract: utilizing state-of-the-art large language models (llms), automatic code generation models play a pivotal role in enhancing the productivity and efficiency of software development coding procedures. as the adoption of llms becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social biases, such as those related to age, gender, and race? this issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. this paper presents a novel bias assessment framework that is specifically designed for code generation tasks. based on this framework, we conduct an extensive evaluation on the bias of nine state-of-the-art llm-based code generation models. our findings reveal that first, 31.45\% to 79.93\% code functions generated by our evaluated code generation models are biased, and 9.68\% to 37.37\% code functions' functionality are affected by the bias, which means biases not only exist in code generation models but in some cases, directly affect the functionality of the generated code, posing risks of unintended and possibly harmful software behaviors. to mitigate bias from code generation models, we propose three mitigation strategies, which can decrease the biased code ratio to a very low level of 0.4\% to 4.57\%.

2023-09-02

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Mengnan Du
Abstract: large language models (llms) have demonstrated impressive capabilities in natural language processing. however, their internal mechanisms are still unclear and this lack of transparency poses unwanted risks for downstream applications. therefore, understanding and explaining these models is crucial for elucidating their behaviors, limitations, and social impacts. in this paper, we introduce a taxonomy of explainability techniques and provide a structured overview of methods for explaining transformer-based language models. we categorize techniques based on the training paradigms of llms: traditional fine-tuning-based paradigm and prompting-based paradigm. for each paradigm, we summarize the goals and dominant approaches for generating local explanations of individual predictions and global explanations of overall model knowledge. we also discuss metrics for evaluating generated explanations, and discuss how explanations can be leveraged to debug models and improve performance. lastly, we examine key challenges and emerging opportunities for explanation techniques in the era of llms in comparison to conventional machine learning models.

2023-09-01

Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez
Abstract: transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. however, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. this work presents a novel geometric perspective explaining universal adversarial attacks on large language models. by attacking the 117m parameter gpt-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. this hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden representations. we believe this new geometric perspective on the underlying mechanism driving universal attacks could help us gain deeper insight into the internal workings and failure modes of llms, thus enabling their mitigation.
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi
Abstract: reinforcement learning from human feedback (rlhf) is effective at aligning large language models (llms) to human preferences, but gathering high quality human preference labels is a key bottleneck. we conduct a head-to-head comparison of rlhf vs. rl from ai feedback (rlaif) - a technique where preferences are labeled by an off-the-shelf llm in lieu of humans, and we find that they result in similar improvements. on the task of summarization, human evaluators prefer generations from both rlaif and rlhf over a baseline supervised fine-tuned model in ~70% of cases. furthermore, when asked to rate rlaif vs. rlhf summaries, humans prefer both at equal rates. these results suggest that rlaif can yield human-level performance, offering a potential solution to the scalability limitations of rlhf.
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-Yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein
Abstract: as large language models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: what threat models are practically useful in this domain? how do baseline defense techniques perform in this new domain? how does llm security differ from computer vision? we evaluate several baseline defense strategies against leading adversarial attacks on llms, discussing the various settings in which each is feasible and effective. particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. we discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. we find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for llms. future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the llms domain than it has been in computer vision.
Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
Abstract: we aim to better understand the emergence of `situational awareness' in large language models (llms). a model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. today's llms are tested for safety and alignment before they are deployed. an llm could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment. situational awareness may emerge unexpectedly as a byproduct of model scaling. one way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. as such an ability, we propose `out-of-context reasoning' (in contrast to in-context learning). we study out-of-context reasoning experimentally. first, we finetune an llm on a description of a test while providing no examples or demonstrations. at test time, we assess whether the model can pass the test. to our surprise, we find that llms succeed on this out-of-context reasoning task. their success is sensitive to the training setup and only works when we apply data augmentation. for both gpt-3 and llama-1, performance improves with model size. these findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in llms. code is available at: https://github.com/asacooperstickland/situational-awareness-evals.
Daniel Scalena, Gabriele Sarti, Malvina Nissim, Elisabetta Fersini
Abstract: due to language models' propensity to generate toxic or hateful responses, several techniques were developed to align model generations with users' preferences. despite the effectiveness of such methods in improving the safety of model interactions, their impact on models' internal processes is still poorly understood. in this work, we apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence using feature attribution methods. we evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification, observing differences in prompt reliance between the two methods despite their similar detoxification performances.
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, Nesreen K. Ahmed
Abstract: rapid advancements of large language models (llms) have enabled the processing, understanding, and generation of human-like text, with increasing integration into systems that touch our social sphere. despite this success, these models can learn, perpetuate, and amplify harmful social biases. in this paper, we present a comprehensive survey of bias evaluation and mitigation techniques for llms. we first consolidate, formalize, and expand notions of social bias and fairness in natural language processing, defining distinct facets of harm and introducing several desiderata to operationalize fairness for llms. we then unify the literature by proposing three intuitive taxonomies, two for bias evaluation, namely metrics and datasets, and one for mitigation. our first taxonomy of metrics for bias evaluation disambiguates the relationship between metrics and evaluation datasets, and organizes metrics by the different levels at which they operate in a model: embeddings, probabilities, and generated text. our second taxonomy of datasets for bias evaluation categorizes datasets by their structure as counterfactual inputs or prompts, and identifies the targeted harms and social groups; we also release a consolidation of publicly-available datasets for improved access. our third taxonomy of techniques for bias mitigation classifies methods by their intervention during pre-processing, in-training, intra-processing, and post-processing, with granular subcategories that elucidate research trends. finally, we identify open problems and challenges for future work. synthesizing a wide range of recent research, we aim to provide a clear guide of the existing literature that empowers researchers and practitioners to better understand and prevent the propagation of bias in llms.
Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, Yejin Choi
Abstract: human values are crucial to human decision-making. value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). as statistical learners, ai systems fit to averages by default, washing out these potentially irreducible value conflicts. to improve ai systems to better reflect value pluralism, the first-order challenge is to explore the extent to which ai systems can model pluralistic human values, rights, and duties as well as their interaction. we introduce valueprism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations. valueprism's contextualized values are generated by gpt-4 and deemed high-quality by human annotators 91% of the time. we conduct a large-scale study with annotators across diverse social and demographic backgrounds to try to understand whose values are represented. with valueprism, we build kaleido, an open, light-weight, and structured language-based multi-task model that generates, explains, and assesses the relevance and valence (i.e., support or oppose) of human values, rights, and duties within a specific context. humans prefer the sets of values output by our system over the teacher gpt-4, finding them more accurate and with broader coverage. in addition, we demonstrate that kaleido can help explain variability in human decision-making by outputting contrasting values. finally, we show that kaleido's representations transfer to other philosophical frameworks and datasets, confirming the benefit of an explicit, modular, and interpretable approach to value pluralism. we hope that our work will serve as a step to making more explicit the implicit values behind human decision-making and to steering ai systems to make decisions that are more in accordance with them.

2023-08-31

Fatma Elsafoury
Abstract: this paper is a summary of the work in my phd thesis. in which, i investigate the impact of bias in nlp models on the task of hate speech detection from three perspectives: explainability, offensive stereotyping bias, and fairness. i discuss the main takeaways from my thesis and how they can benefit the broader nlp community. finally, i discuss important future research directions. the findings of my thesis suggest that bias in nlp models impacts the task of hate speech detection from all three perspectives. and that unless we start incorporating social sciences in studying bias in nlp models, we will not effectively overcome the current limitations of measuring and mitigating bias in nlp models.
Nayeon Lee, Chani Jung, Junho Myung, Jiho Jin, Juho Kim, Alice Oh
Abstract: english datasets predominantly reflect the perspectives of certain nationalities, which can lead to cultural biases in models and datasets. this is particularly problematic in tasks heavily influenced by subjectivity, such as hate speech detection. to delve into how individuals from different countries perceive hate speech, we introduce crehate, a cross-cultural re-annotation of the sampled sbic dataset. this dataset includes annotations from five distinct countries: australia, singapore, south africa, the united kingdom, and the united states. our thorough statistical analysis highlights significant differences based on nationality, with only 59.4% of the samples achieving consensus among all countries. we also introduce a culturally sensitive hate speech classifier via transfer learning, adept at capturing perspectives of different nationalities. these findings underscore the need to re-evaluate certain aspects of nlp research, especially with regard to the nuanced nature of hate speech in the english language.
Muris Sladić, Veronica Valeros, Carlos Catania, Sebastian Garcia
Abstract: honeypots are essential tools in cybersecurity. however, most of them (even the high-interaction ones) lack the required realism to engage and fool human attackers. this limitation makes them easily discernible, hindering their effectiveness. this work introduces a novel method to create dynamic and realistic software honeypots based on large language models. preliminary results indicate that llms can create credible and dynamic honeypots capable of addressing important limitations of previous honeypots, such as deterministic responses, lack of adaptability, etc. we evaluated the realism of each command by conducting an experiment with human attackers who needed to say if the answer from the honeypot was fake or not. our proposed honeypot, called shellm, reached an accuracy rate of 0.92.
Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons
Abstract: are foundation models secure from malicious actors? in this work, we focus on the image input to a vision-language model (vlm). we discover image hijacks, adversarial images that control generative models at runtime. we introduce behaviour matching, a general method for creating image hijacks, and we use it to explore three types of attacks. specific string attacks generate arbitrary output of the adversary's choice. leak context attacks leak information from the context window into the output. jailbreak attacks circumvent a model's safety training. we study these attacks against llava, a state-of-the-art vlm based on clip and llama-2, and find that all our attack types have above a 90% success rate. moreover, our attacks are automated and require only small image perturbations. these findings raise serious concerns about the security of foundation models. if image hijacks are as difficult to defend against as adversarial examples in cifar-10, then it might be many years before a solution is found -- if it even exists.

2023-08-30

Hritik Bansal, John Dang, Aditya Grover
Abstract: aligning large language models (llms) with human values and intents critically involves the use of human or ai feedback. while dense feedback annotations are expensive to acquire and integrate, sparse feedback presents a structural design choice between ratings (e.g., score response a on a scale of 1-7) and rankings (e.g., is response a better than response b?). in this work, we analyze the effect of this design choice for the alignment and evaluation of llms. we uncover an inconsistency problem wherein the preferences inferred from ratings and rankings significantly disagree 60% for both human and ai annotators. our subsequent analysis identifies various facets of annotator biases that explain this phenomena, such as human annotators would rate denser responses higher while preferring accuracy during pairwise judgments. to our surprise, we also observe that the choice of feedback protocol also has a significant effect on the evaluation of aligned llms. in particular, we find that llms that leverage rankings data for alignment (say model x) are preferred over those that leverage ratings data (say model y), with a rank-based evaluation protocol (is x/y's response better than reference response?) but not with a rating-based evaluation protocol (score rank x/y's response on a scale of 1-7). our findings thus shed light on critical gaps in methods for evaluating the real-world utility of language models and their strong dependence on the feedback protocol used for alignment. our code and data are available at https://github.com/hritikbansal/sparse_feedback.
Inyoung Cheong, Aylin Caliskan, Tadayoshi Kohno
Abstract: our interdisciplinary study investigates how effectively u.s. laws confront the challenges posed by generative ai to human values. through an analysis of diverse hypothetical scenarios crafted during an expert workshop, we have identified notable gaps and uncertainties within the existing legal framework regarding the protection of fundamental values, such as privacy, autonomy, dignity, diversity, equity, and physical/mental well-being. constitutional and civil rights, it appears, may not provide sufficient protection against ai-generated discriminatory outputs. furthermore, even if we exclude the liability shield provided by section 230, proving causation for defamation and product liability claims is a challenging endeavor due to the intricate and opaque nature of ai systems. to address the unique and unforeseeable threats posed by generative ai, we advocate for legal frameworks that evolve to recognize new threats and provide proactive, auditable guidelines to industry stakeholders. addressing these issues requires deep interdisciplinary collaborations to identify harms, values, and mitigation strategies.
Jiuhai Chen, Jonas Mueller
Abstract: we introduce bsdetector, a method for detecting bad and speculative answers from a pretrained large language model by estimating a numeric confidence score for any output it generated. our uncertainty quantification technique works for any llm accessible only via a black-box api, whose training data remains unknown. by expending a bit of extra computation, users of any llm api can now get the same response as they would ordinarily, as well as a confidence estimate that cautions when not to trust this response. experiments on both closed and open-form question-answer benchmarks reveal that bsdetector more accurately identifies incorrect llm responses than alternative uncertainty estimation procedures (for both gpt-3 and chatgpt). by sampling multiple responses from the llm and considering the one with the highest confidence score, we can additionally obtain more accurate responses from the same llm, without any extra training steps. in applications involving automated evaluation with llms, accounting for our confidence scores leads to more reliable evaluation in both human-in-the-loop and fully-automated settings (across both gpt 3.5 and 4).
Matija Franklin, Philip Moreira Tomei, Rebecca Gorman
Abstract: the european union's artificial intelligence act aims to regulate manipulative and harmful uses of ai, but lacks precise definitions for key concepts. this paper provides technical recommendations to improve the act's conceptual clarity and enforceability. we review psychological models to define "personality traits," arguing the act should protect full "psychometric profiles." we urge expanding "behavior" to include "preferences" since preferences causally influence and are influenced by behavior. clear definitions are provided for "subliminal," "manipulative," and "deceptive" techniques, considering incentives, intent, and covertness. we distinguish "exploiting individuals" from "exploiting groups," emphasising different policy needs. an "informed decision" is defined by four facets: comprehension, accurate information, no manipulation, and understanding ai's influence. we caution the act's therapeutic use exemption given the lack of regulation of digital therapeutics by the ema. overall, the recommendations strengthen definitions of vague concepts in the eu ai act, enhancing precise applicability to regulate harmful ai manipulation.

2023-08-29

Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, Jitao Sang, Haoyu Tang
Abstract: large vision-language models (lvlms) have recently achieved remarkable success. however, lvlms are still plagued by the hallucination problem, which limits the practicality in many scenarios. hallucination refers to the information of lvlms' responses that does not exist in the visual input, which poses potential risks of substantial consequences. there has been limited work studying hallucination evaluation in lvlms. in this paper, we propose hallucination evaluation based on large language models (haelm), an llm-based hallucination evaluation framework. haelm achieves an approximate 95% performance comparable to chatgpt and has additional advantages including low cost, reproducibility, privacy preservation and local deployment. leveraging the haelm, we evaluate the hallucination in current lvlms. furthermore, we analyze the factors contributing to hallucination in lvlms and offer helpful suggestions to mitigate the hallucination problem. our training data and human annotation hallucination data will be made public soon.
Jingyan Zhou, Minda Hu, Junan Li, Xiaoying Zhang, Xixin Wu, Irwin King, Helen Meng
Abstract: making moral judgments is an essential step toward developing ethical ai systems. prevalent approaches are mostly implemented in a bottom-up manner, which uses a large set of annotated data to train models based on crowd-sourced opinions about morality. these approaches have been criticized for potentially overgeneralizing a limited group of annotators' moral stances and lacking explainability. in contrast, top-down approaches make moral judgments grounded in a set of principles. however, it remains conceptual due to the incapability of previous language models and the unsolved debate among moral principles. in this study, we propose a flexible framework to steer large language models (llms) to perform moral reasoning with well-established moral theories from interdisciplinary research. the theory-guided top-down framework can incorporate various moral theories. our experiments demonstrate the effectiveness of the proposed framework on datasets derived from moral theories. furthermore, we show the alignment between different moral theories and existing morality datasets. our analysis exhibits the potentials and flaws in existing resources (models and datasets) in developing explainable moral judgment-making systems.
Robert Trager, Ben Harack, Anka Reuel, Allison Carnegie, Lennart Heim, Lewis Ho, Sarah Kreps, Ranjit Lall, Owen Larter, Seán Ó Héigeartaigh, Simon Staffell, José Jaime Villalobos
Abstract: this report describes trade-offs in the design of international governance arrangements for civilian artificial intelligence (ai) and presents one approach in detail. this approach represents the extension of a standards, licensing, and liability regime to the global level. we propose that states establish an international ai organization (iaio) to certify state jurisdictions (not firms or ai projects) for compliance with international oversight standards. states can give force to these international standards by adopting regulations prohibiting the import of goods whose supply chains embody ai from non-iaio-certified jurisdictions. this borrows attributes from models of existing international organizations, such as the international civilian aviation organization (icao), the international maritime organization (imo), and the financial action task force (fatf). states can also adopt multilateral controls on the export of ai product inputs, such as specialized hardware, to non-certified jurisdictions. indeed, both the import and export standards could be required for certification. as international actors reach consensus on risks of and minimum standards for advanced ai, a jurisdictional certification regime could mitigate a broad range of potential harms, including threats to public safety.
Zhenhong Zhou, Jiuyang Xiang, Chaomeng Chen, Sen Su
Abstract: large language models (llms) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. as the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. quantifying language model memorization helps evaluate potential privacy risks. however, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. to this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. in addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. we conduct extensive experiments based on the proposed, probing language models' ability to reconstruct sensitive entities under different settings. we find that language models have strong memorization at the entity level and are able to reproduce the training data even with partial leakages. the results demonstrate that llms not only memorize their training data but also understand associations between entities. these findings necessitate that trainers of llms exercise greater prudence regarding model memorization, adopting memorization mitigation techniques to preclude privacy violations.

2023-08-28

Haomiao Yang, Kunlan Xiang, Mengyu Ge, Hongwei Li, Rongxing Lu, Shui Yu
Abstract: the large language models (llms) are poised to offer efficient and intelligent services for future mobile communication networks, owing to their exceptional capabilities in language comprehension and generation. however, the extremely high data and computational resource requirements for the performance of llms compel developers to resort to outsourcing training or utilizing third-party data and computing resources. these strategies may expose the model within the network to maliciously manipulated training data and processing, providing an opportunity for attackers to embed a hidden backdoor into the model, termed a backdoor attack. backdoor attack in llms refers to embedding a hidden backdoor in llms that causes the model to perform normally on benign samples but exhibit degraded performance on poisoned ones. this issue is particularly concerning within communication networks where reliability and security are paramount. despite the extensive research on backdoor attacks, there remains a lack of in-depth exploration specifically within the context of llms employed in communication networks, and a systematic review of such attacks is currently absent. in this survey, we systematically propose a taxonomy of backdoor attacks in llms as used in communication networks, dividing them into four major categories: input-triggered, prompt-triggered, instruction-triggered, and demonstration-triggered attacks. furthermore, we conduct a comprehensive analysis of the benchmark datasets. finally, we identify potential problems and open challenges, offering valuable insights into future research directions for enhancing the security and integrity of llms in communication networks.
Vahid Ghafouri, Vibhor Agarwal, Yong Zhang, Nishanth Sastry, Jose Such, Guillermo Suarez-Tangil
Abstract: the introduction of chatgpt and the subsequent improvement of large language models (llms) have prompted more and more individuals to turn to the use of chatbots, both for information and assistance with decision-making. however, the information the user is after is often not formulated by these chatbots objectively enough to be provided with a definite, globally accepted answer. controversial topics, such as "religion", "gender identity", "freedom of speech", and "equality", among others, can be a source of conflict as partisan or biased answers can reinforce preconceived notions or promote disinformation. by exposing chatgpt to such debatable questions, we aim to understand its level of awareness and if existing models are subject to socio-political and/or economic biases. we also aim to explore how ai-generated answers compare to human ones. for exploring this, we use a dataset of a social media platform created for the purpose of debating human-generated claims on polemic subjects among users, dubbed kialo. our results show that while previous versions of chatgpt have had important issues with controversial topics, more recent versions of chatgpt (gpt-3.5-turbo) are no longer manifesting significant explicit biases in several knowledge areas. in particular, it is well-moderated regarding economic aspects. however, it still maintains degrees of implicit libertarian leaning toward right-winged ideals which suggest the need for increased moderation from the socio-political point of view. in terms of domain knowledge on controversial topics, with the exception of the "philosophical" category, chatgpt is performing well in keeping up with the collective human level of knowledge. finally, we see that sources of bing ai have slightly more tendency to the center when compared to human answers. all the analyses we make are generalizable to other types of biases and domains.
Fabian Lechner, Allison Lahnala, Charles Welch, Lucie Flek
Abstract: the potential to provide patients with faster information access while allowing medical specialists to concentrate on critical tasks makes medical domain dialog agents appealing. however, the integration of large-language models (llms) into these agents presents certain limitations that may result in serious consequences. this paper investigates the challenges and risks of using gpt-3-based models for medical question-answering (medqa). we perform several evaluations contextualized in terms of standard medical principles. we provide a procedure for manually designing patient queries to stress-test high-risk limitations of llms in medqa systems. our analysis reveals that llms fail to respond adequately to these queries, generating erroneous medical information, unsafe recommendations, and content that may be considered offensive.
Thanh Thi Nguyen, Campbell Wilson, Janis Dalins
Abstract: detecting online sexual predatory behaviours and abusive language on social media platforms has become a critical area of research due to the growing concerns about online safety, especially for vulnerable populations such as children and adolescents. researchers have been exploring various techniques and approaches to develop effective detection systems that can identify and mitigate these risks. recent development of large language models (llms) has opened a new opportunity to address this problem more effectively. this paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained llama 2 7b-parameter model, recently released by meta genai. we fine-tune the llm using datasets with different sizes, imbalance degrees, and languages (i.e., english, roman urdu and urdu). based on the power of llms, our approach is generic and automated without a manual search for a synergy between feature extraction and classifier design steps like conventional methods in this domain. experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets with five sets of experiments. this study's outcomes indicate that the proposed method can be implemented in real-world applications (even with non-english languages) for flagging sexual predators, offensive or toxic content, hate speech, and discriminatory language in online discussions and comments to maintain respectful internet or digital communities. furthermore, it can be employed for solving text classification problems with other potential applications such as sentiment analysis, spam and phishing detection, sorting legal documents, fake news detection, language identification, user intent recognition, text-based product categorization, medical record analysis, and resume screening.
Peter S. Park, Simon Goldstein, "Aidan O'Gara", Michael Chen, Dan Hendrycks
Abstract: this paper argues that a range of current ai systems have learned how to deceive humans. we define deception as the systematic inducement of false beliefs in the pursuit of some outcome other than the truth. we first survey empirical examples of ai deception, discussing both special-use ai systems (including meta's cicero) built for specific competitive situations, and general-purpose ai systems (such as large language models). next, we detail several risks from ai deception, such as fraud, election tampering, and losing control of ai systems. finally, we outline several potential solutions to the problems posed by ai deception: first, regulatory frameworks should subject ai systems that are capable of deception to robust risk-assessment requirements; second, policymakers should implement bot-or-not laws; and finally, policymakers should prioritize the funding of relevant research, including tools to detect ai deception and to make ai systems less deceptive. policymakers, researchers, and the broader public should work proactively to prevent ai deception from destabilizing the shared foundations of our society.
Clark Barrett, Brad Boyd, Elie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang
Abstract: every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. generative ai (genai) techniques, such as large language models (llms) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). however, genai can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. this paper reports the findings of a workshop held at google (co-organized by stanford university and the university of wisconsin-madison) on the dual-use dilemma posed by genai. this paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. we discuss short-term and long-term goals for the community on this topic. we hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.
Hadas Kotek, Rikker Dockum, David Q. Sun
Abstract: large language models (llms) have made substantial progress in the past several months, shattering state-of-the-art benchmarks in many domains. this paper investigates llms' behavior with respect to gender stereotypes, a known issue for prior models. we use a simple paradigm to test the presence of gender bias, building on but differing from winobias, a commonly used gender bias dataset, which is likely to be included in the training data of current llms. we test four recently published llms and demonstrate that they express biased assumptions about men and women's occupations. our contributions in this paper are as follows: (a) llms are 3-6 times more likely to choose an occupation that stereotypically aligns with a person's gender; (b) these choices align with people's perceptions better than with the ground truth as reflected in official job statistics; (c) llms in fact amplify the bias beyond what is reflected in perceptions or the ground truth; (d) llms ignore crucial ambiguities in sentence structure 95% of the time in our study items, but when explicitly prompted, they recognize the ambiguity; (e) llms provide explanations for their choices that are factually inaccurate and likely obscure the true reason behind their predictions. that is, they provide rationalizations of their biased behavior. this highlights a key property of these models: llms are trained on imbalanced datasets; as such, even with the recent successes of reinforcement learning with human feedback, they tend to reflect those imbalances back at us. as with other types of societal biases, we suggest that llms must be carefully tested to ensure that they treat minoritized individuals and communities equitably.

2023-08-27

Gabriel Alon, Michael Kamfonas
Abstract: a novel hack involving large language models (llms) has emerged, leveraging adversarial suffixes to trick models into generating perilous responses. this method has garnered considerable attention from reputable media outlets such as the new york times and wired, thereby influencing public perception regarding the security and safety of llms. in this study, we advocate the utilization of perplexity as one of the means to recognize such potential attacks. the underlying concept behind these hacks revolves around appending an unusually constructed string of text to a harmful query that would otherwise be blocked. this maneuver confuses the protective mechanisms and tricks the model into generating a forbidden response. such scenarios could result in providing detailed instructions to a malicious user for constructing explosives or orchestrating a bank heist. our investigation demonstrates the feasibility of employing perplexity, a prevalent natural language processing metric, to detect these adversarial tactics before generating a forbidden response. by evaluating the perplexity of queries with and without such adversarial suffixes using an open-source llm, we discovered that nearly 90 percent were above a perplexity of 1000. this contrast underscores the efficacy of perplexity for detecting this type of exploit.
Alexander J. Titus, Adam H. Russell
Abstract: artificial intelligence (ai) promises immense benefits across sectors, yet also poses risks from dual-use potentials, biases, and unintended behaviors. this paper reviews emerging issues with opaque and uncontrollable ai systems and proposes an integrative framework called violet teaming to develop reliable and responsible ai. violet teaming combines adversarial vulnerability probing (red teaming) with solutions for safety and security (blue teaming) while prioritizing ethics and social benefit. it emerged from ai safety research to manage risks proactively by design. the paper traces the evolution of red, blue, and purple teaming toward violet teaming, and then discusses applying violet techniques to address biosecurity risks of ai in biotechnology. additional sections review key perspectives across law, ethics, cybersecurity, macrostrategy, and industry best practices essential for operationalizing responsible ai through holistic technical and social considerations. violet teaming provides both philosophy and method for steering ai trajectories toward societal good. with conscience and wisdom, the extraordinary capabilities of ai can enrich humanity. but without adequate precaution, the risks could prove catastrophic. violet teaming aims to empower moral technology for the common welfare.

2023-08-26

"Charles O'Neill", Jack Miller, Ioana Ciuca, Yuan-Sen Ting, Thang Bui
Abstract: in this paper, we tackle the emerging challenge of unintended harmful content generation in large language models (llms) with a novel dual-stage optimisation technique using adversarial fine-tuning. our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. in this adversarial cycle, the two models seek to outperform each other in the prompting phase, generating a dataset of rich examples which are then used for fine-tuning. this iterative application of prompting and fine-tuning allows continuous refinement and improved performance. the performance of our approach is evaluated through classification accuracy on a dataset consisting of problematic prompts not detected by gpt-4, as well as a selection of contentious but unproblematic prompts. we show considerable increase in classification accuracy of the judge model on this challenging dataset as it undergoes the optimisation process. furthermore, we show that a rudimentary model \texttt{ada} can achieve 13\% higher accuracy on the hold-out test set than gpt-4 after only a few rounds of this process, and that this fine-tuning improves performance in parallel tasks such as toxic comment identification.
Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao, Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen
Abstract: prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. the state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. to address this issue, we propose lmsanitator, a novel approach for detecting and removing task-agnostic backdoors on transformer models. instead of directly inverting the triggers, lmsanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. lmsanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. extensive experiments on multiple language models and nlp tasks illustrate the effectiveness of lmsanitator. for instance, lmsanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.

2023-08-25

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, Timothy Baldwin
Abstract: with the rapid evolution of large language models (llms), new and hard-to-predict harmful capabilities are emerging. this requires developers to be able to identify risks through the evaluation of "dangerous capabilities" in order to responsibly deploy llms. in this work, we collect the first open-source dataset to evaluate safeguards in llms, and deploy safer open-source llms at a low cost. our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. we annotate and assess the responses of six popular llms to these instructions. based on our annotation, we proceed to train several bert-like classifiers, and find that these small classifiers can achieve results that are comparable with gpt-4 on automatic safety evaluation. warning: this paper contains example data that may be offensive, harmful, or biased.
Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki
Abstract: from the perspective of content safety issues, alignment has shown to limit large language models' (llms) harmful content generation. this intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as openassistant or guanaco. we introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. to be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as big bench (bbh), massive multitask language understanding (mmlu), human eval, and discrete reasoning over paragraphs (drop), performing worse than the counterpart tuned without alignment by 4-33%.
Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, Miguel Rodrigues
Abstract: the deployment of large language models (llms) raises concerns regarding their cultural misalignment and potential ramifications on individuals from various cultural norms. existing work investigated political and social biases and public opinions rather than their cultural values. to address this limitation, the proposed cultural alignment test (cat) quantifies cultural alignment using hofstede's cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. we apply our approach to assess the cultural values embedded in state-of-the-art llms, such as: chatgpt and bard, across diverse cultures of countries: united states (us), saudi arabia, china, and slovakia, using different prompting styles and hyperparameter settings. our results not only quantify cultural alignment of llms with certain countries, but also reveal the difference between llms in explanatory cultural dimensions. while all llms did not provide satisfactory results in understanding cultural values, gpt-4 exhibited the highest cat score for the cultural values of the us.

2023-08-24

Yachao Zhao, Bo Wang, Dongming Zhao, Kun Huang, Yan Wang, Ruifang He, Yuexian Hou
Abstract: recent researches indicate that pre-trained large language models (llms) possess cognitive constructs similar to those observed in humans, prompting researchers to investigate the cognitive aspects of llms. this paper focuses on explicit and implicit social bias, a distinctive two-level cognitive construct in psychology. it posits that individuals' explicit social bias, which is their conscious expression of bias in the statements, may differ from their implicit social bias, which represents their unconscious bias. we propose a two-stage approach and discover a parallel phenomenon in llms known as "re-judge inconsistency" in social bias. in the initial stage, the llm is tasked with automatically completing statements, potentially incorporating implicit social bias. however, in the subsequent stage, the same llm re-judges the biased statement generated by itself but contradicts it. we propose that this re-judge inconsistency can be similar to the inconsistency between human's unaware implicit social bias and their aware explicit social bias. experimental investigations on chatgpt and gpt-4 concerning common gender biases examined in psychology corroborate the highly stable nature of the re-judge inconsistency. this finding may suggest that diverse cognitive constructs emerge as llms' capabilities strengthen. consequently, leveraging psychological theories can provide enhanced insights into the underlying mechanisms governing the expressions of explicit and implicit constructs in llms.
Maximilian Mozes, Xuanli He, Bennett Kleinberg, Lewis D. Griffin
Abstract: spurred by the recent rapid increase in the development and distribution of large language models (llms) across industry and academia, much recent work has drawn attention to safety- and security-related threats and vulnerabilities of llms, including in the context of potentially criminal activities. specifically, it has been shown that llms can be misused for fraud, impersonation, and the generation of malware; while other authors have considered the more general problem of ai alignment. it is important that developers and practitioners alike are aware of security-related problems with such models. in this paper, we provide an overview of existing - predominantly scientific - efforts on identifying and mitigating threats and vulnerabilities arising from llms. we present a taxonomy describing the relationship between threats caused by the generative capabilities of llms, prevention measures intended to address such threats, and vulnerabilities arising from imperfect prevention measures. with our work, we hope to raise awareness of the limitations of llms in light of such security concerns, among both experienced developers and novel users of such technologies.
Pranav Narayanan Venkit
Abstract: the rapid growth in the usage and applications of natural language processing (nlp) in various sociotechnical solutions has highlighted the need for a comprehensive understanding of bias and its impact on society. while research on bias in nlp has expanded, several challenges persist that require attention. these include the limited focus on sociodemographic biases beyond race and gender, the narrow scope of analysis predominantly centered on models, and the technocentric implementation approaches. this paper addresses these challenges and advocates for a more interdisciplinary approach to understanding bias in nlp. the work is structured into three facets, each exploring a specific aspect of bias in nlp.

2023-08-23

Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang, Xing Xie
Abstract: big models, exemplified by large language models (llms), are models typically pre-trained on massive data and comprised of enormous parameters, which not only obtain significantly improved performance across diverse tasks but also present emergent capabilities absent in smaller models. however, the growing intertwining of big models with everyday human lives poses potential risks and might cause serious social harm. therefore, many efforts have been made to align llms with humans to make them better follow user instructions and satisfy human preferences. nevertheless, `what to align with' has not been fully discussed, and inappropriate alignment goals might even backfire. in this paper, we conduct a comprehensive survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal. particularly, we investigate related works from two perspectives: the definition of alignment goals and alignment evaluation. our analysis encompasses three distinct levels of alignment goals and reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced llms. based on such results, we further discuss the challenges of achieving such intrinsic value alignment and provide a collection of available resources for future research on the alignment of big models.
Jian Hu, Li Tao, June Yang, Chandler Zhou
Abstract: learning from human preferences is crucial for language models (lms) to effectively cater to human needs and societal values. previous research has made notable progress by leveraging human feedback to follow instructions. however, these approaches rely primarily on online reinforcement learning (rl) techniques like proximal policy optimization (ppo), which have been proven unstable and challenging to tune for language models. moreover, ppo requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. in this study, we propose an offline reinforcement learning from human feedback (rlhf) framework to align lms using pre-generated samples without interacting with rl environments. specifically, we explore maximum likelihood estimation (mle) with filtering, reward-weighted regression (rwr), and decision transformer (dt) to align language models to human preferences. by employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than ppo with a simple machine learning system~(mlsys) and much fewer (around 12.3\%) computing resources. experimental results demonstrate the dt alignment outperforms other offline rlhf methods and is better than ppo.
Maria Rigaki, Ondřej Lukáš, Carlos A. Catania, Sebastian Garcia
Abstract: large language models (llms) have gained widespread popularity across diverse domains involving text generation, summarization, and various natural language processing tasks. despite their inherent limitations, llm-based designs have shown promising capabilities in planning and navigating open-world scenarios. this paper introduces a novel application of pre-trained llms as agents within cybersecurity network environments, focusing on their utility for sequential decision-making processes. we present an approach wherein pre-trained llms are leveraged as attacking agents in two reinforcement learning environments. our proposed agents demonstrate similar or better performance against state-of-the-art agents trained for thousands of episodes in most scenarios and configurations. in addition, the best llm agents perform similarly to human testers of the environment without any additional training process. this design highlights the potential of llms to efficiently address complex decision-making tasks within cybersecurity. furthermore, we introduce a new network security environment named netsecgame. the environment is designed to eventually support complex multi-agent scenarios within the network security domain. the proposed environment mimics real network attacks and is designed to be highly modular and adaptable for various scenarios.
Madelyne Xiao, Jonathan Mayer
Abstract: we examine the disconnect between scholarship and practice in applying machine learning to trust and safety problems, using misinformation detection as a case study. we systematize literature on automated detection of misinformation across a corpus of 270 well-cited papers in the field. we then examine subsets of papers for data and code availability, design missteps, reproducibility, and generalizability. we find significant shortcomings in the literature that call into question claimed performance and practicality. detection tasks are often meaningfully distinct from the challenges that online services actually face. datasets and model evaluation are often non-representative of real-world contexts, and evaluation frequently is not independent of model training. data and code availability is poor. models do not generalize well to out-of-domain data. based on these results, we offer recommendations for evaluating machine learning applications to trust and safety problems. our aim is for future work to avoid the pitfalls that we identify.
Fredrik Heiding, Bruce Schneier, Arun Vishwanath, Jeremy Bernstein
Abstract: ai programs, built using large language models, make it possible to automatically create phishing emails based on a few data points about a user. they stand in contrast to traditional phishing emails that hackers manually design using general rules gleaned from experience. the v-triad is an advanced set of rules for manually designing phishing emails to exploit our cognitive heuristics and biases. in this study, we compare the performance of phishing emails created automatically by gpt-4 and manually using the v-triad. we also combine gpt-4 with the v-triad to assess their combined potential. a fourth group, exposed to generic phishing emails, was our control group. we utilized a factorial approach, sending emails to 112 randomly selected participants recruited for the study. the control group emails received a click-through rate between 19-28%, the gpt-generated emails 30-44%, emails generated by the v-triad 69-79%, and emails generated by gpt and the v-triad 43-81%. each participant was asked to explain for why they pressed or did not press a link in the email. these answers often contradict each other, highlighting the need for personalized content. the cues that make one person avoid phishing emails make another person fall for them. next, we used four popular large language models (gpt, claude, palm, and llama) to detect the intention of phishing emails and compare the results to human detection. the language models demonstrated a strong ability to detect malicious intent, even in non-obvious phishing emails. they sometimes surpassed human detection, although often being slightly less accurate than humans.
Vipul Gupta, Pranav Narayanan Venkit, Hugo Laurençon, Shomir Wilson, Rebecca J. Passonneau
Abstract: as language models (lms) become increasingly powerful, it is important to quantify and compare them for sociodemographic bias with potential for harm. prior bias measurement datasets are sensitive to perturbations in their manually designed templates, therefore unreliable. to achieve reliability, we introduce the comprehensive assessment of language model bias (calm), a benchmark dataset to quantify bias in lms across three tasks. we integrate 16 existing datasets across different domains, such as wikipedia and news articles, to filter 224 templates from which we construct a dataset of 78,400 examples. we compare the diversity of calm with prior datasets on metrics such as average semantic similarity, and variation in template length, and test the sensitivity to small perturbations. we show that our dataset is more diverse and reliable than previous datasets, thus better capture the breadth of linguistic variation required to reliably evaluate model bias. we evaluate 20 large language models including six prominent families of lms such as llama-2. in two lm series, opt and bloom, we found that larger parameter models are more biased than lower parameter models. we found the t0 series of models to be the least biased. furthermore, we noticed a tradeoff between gender and racial bias with increasing model size in some model series. the code is available at https://github.com/vipulgupta1011/calm.

2023-08-22

Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, Shizhu Liu, Pingchuan Tian, Yuping Wang, Yuxuan Wang
Abstract: large language models (llms) have revolutionized natural language processing (nlp). although convenient for research and practical applications, open-source llms with fewer parameters often suffer from severe hallucinations compared to their larger counterparts. this paper focuses on measuring and reducing hallucinations in bloom 7b, a representative of such weaker open-source llms that are publicly available for research and commercial applications. we introduce halocheck, a lightweight blackbox knowledge-free framework designed to quantify the severity of hallucinations in llms. additionally, we explore techniques like knowledge injection and teacher-student approaches to alleviate hallucinations in low-parameter llms. our experiments effectively demonstrate the reduction of hallucinations in challenging domains for these llms.

2023-08-21

Fatma Elsafoury
Abstract: research has shown that language models (lms) are socially biased. however, toxicity and offensive stereotyping bias in lms are understudied. in this paper, we investigate the systematic offensive stereotype (sos) bias in lms. we propose a method to measure it. then, we validate the sos bias and investigate the effectiveness of debias methods from the literature on removing it. finally, we investigate the impact of the sos bias in lms on their performance and their fairness on the task of hate speech detection. our results suggest that all the inspected lms are sos biased. the results suggest that the sos bias in lms is reflective of the hate experienced online by the inspected marginalized groups. the results indicate that removing the sos bias in lms, using a popular debias method from the literature, leads to worse sos bias scores. finally, our results show no strong evidence that the sos bias in lms is impactful on their performance on hate speech detection. on the other hand, there is evidence that the sos bias in lms is impactful on their fairness.
Christian Schlarmann, Matthias Hein
Abstract: multi-modal foundation models combining vision and language models such as flamingo or gpt-4 have recently gained enormous interest. alignment of foundation models is used to prevent models from providing toxic or harmful output. while malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. in this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. this indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.
Matthew R. Deverna, Harry Yaojun Yan, Kai-Cheng Yang, Filippo Menczer
Abstract: fact checking can be an effective strategy against misinformation, but its implementation at scale is impeded by the overwhelming volume of information online. recent artificial intelligence (ai) language models have shown impressive ability in fact-checking tasks, but how humans interact with fact-checking information provided by these models is unclear. here we investigate the impact of fact checks generated by a popular ai model on belief in, and sharing intent of, political news in a preregistered randomized control experiment. although the ai performs reasonably well in debunking false headlines, we find that it does not significantly affect participants' ability to discern headline accuracy or share accurate news. however, the ai fact-checker is harmful in specific cases: it decreases beliefs in true headlines that it mislabels as false and increases beliefs for false headlines that it is unsure about. on the positive side, the ai increases sharing intents for correctly labeled true headlines. when participants are given the option to view ai fact checks and choose to do so, they are significantly more likely to share both true and false news but only more likely to believe false news. our findings highlight an important source of potential harm stemming from ai applications and underscore the critical need for policies to prevent or mitigate such unintended consequences.
Alex Nyffenegger, Matthias Stürmer, Joel Niklaus
Abstract: anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the european union and switzerland. with the advent of llms, concerns about large-scale re-identification of anonymized persons are growing. in accordance with the federal supreme court of switzerland, we explore the potential of llms to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the swiss federal supreme court. following the initial experiment, we constructed an anonymized wikipedia dataset as a more rigorous testing ground to further investigate the findings. with the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. we systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. despite high re-identification rates on wikipedia, even the best llms struggled with court decisions. the complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. in conclusion, this study demonstrates that re-identification using llms may not be feasible for now, but as the proof-of-concept on wikipedia showed, it might become possible in the future. we hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.

2023-08-20

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte Macdiarmid
Abstract: reliably controlling the behavior of large language models is a pressing open problem. existing methods include supervised finetuning, reinforcement learning from human feedback, prompt engineering, and guided decoding. we instead investigate activation engineering: modifying activations at inference time to predictably alter model behavior. in particular, we bias the forward pass with an added 'steering vector' implicitly specified through natural language. unlike past work which learned these steering vectors, our activation addition (actadd) method computes them by taking the activation differences that result from pairs of prompts. we demonstrate actadd on gpt-2 on openwebtext and conceptnet. our inference-time approach yields control over high-level properties of output and preserves off-target model performance. it involves far less compute and implementation effort than finetuning, allows users to provide natural language specifications, and its overhead scales naturally with model size.
Li Zhong, Zilong Wang
Abstract: recently, the large language models (llms) have shown extraordinary ability in understanding natural language and generating programming code. it has been a common practice of software engineers to consult llms when encountering coding questions. although efforts have been made to avoid syntax errors and align the code with the intended semantics, the reliability and robustness of the code generationfrom llms have not yet been thoroughly studied. the executable code is not equivalent to the reliable and robust code, especially in the context of real-world software development. the misuse of apis in the generated code could lead to severe problem, such as resource leaks, program crashes. to make things worse, the users of llm code generation services are actually the developers that are most vulnerable to these code that seems right -- they are always novice developers that are not familiar with the apis that llms generate code for them. therefore, they could hardly tell the misuse in the code generated by llms, which further facilitates the incorrect code applied in real-world software. existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask llm for real-world coding help. to fill the missing piece, in this work, we propose a dataset robustapi for evaluating the reliability and robustness of code generated by llms. we collect 1208 coding questions from stackoverflow on 24 representative java apis. we summarize thecommon misuse patterns of these apis and evaluate them oncurrent popular llms. the evaluation results show that evenfor gpt-4, 62% of the generated code contains api misuses,which would cause unexpected consequences if the code isintroduced into real-world software.
David Noever
Abstract: in this study, we evaluated the capability of large language models (llms), particularly openai's gpt-4, in detecting software vulnerabilities, comparing their performance against traditional static code analyzers like snyk and fortify. our analysis covered numerous repositories, including those from nasa and the department of defense. gpt-4 identified approximately four times the vulnerabilities than its counterparts. furthermore, it provided viable fixes for each vulnerability, demonstrating a low rate of false positives. our tests encompassed 129 code samples across eight programming languages, revealing the highest vulnerabilities in php and javascript. gpt-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines. a critical insight was llms' ability to self-audit, suggesting fixes for their identified vulnerabilities and underscoring their precision. future research should explore system-level vulnerabilities and integrate multiple static code analyzers for a holistic perspective on llms' potential.
Yanhong Bai, Jiabao Zhao, Jinxin Shi, Tingjiang Wei, Xingjiao Wu, Liang He
Abstract: detecting stereotypes and biases in large language models (llms) can enhance fairness and reduce adverse impacts on individuals or groups when these llms are applied. however, the majority of existing methods focus on measuring the model's preference towards sentences containing biases and stereotypes within datasets, which lacks interpretability and cannot detect implicit biases and stereotypes in the real world. to address this gap, this paper introduces a four-stage framework to directly evaluate stereotypes and biases in the generated content of llms, including direct inquiry testing, serial or adapted story testing, implicit association testing, and unknown situation testing. additionally, the paper proposes multi-dimensional evaluation metrics and explainable zero-shot prompts for automated evaluation. using the education sector as a case study, we constructed the edu-fairbench based on the four-stage framework, which encompasses 12,632 open-ended questions covering nine sensitive factors and 26 educational scenarios. experimental results reveal varying degrees of stereotypes and biases in five llms evaluated on edu-fairbench. moreover, the results of our proposed automated evaluation method have shown a high correlation with human annotations.
Wesley Tann, Yuancheng Liu, Jun Heng Sim, Choon Meng Seah, Ee-Chien Chang
Abstract: the assessment of cybersecurity capture-the-flag (ctf) exercises involves participants finding text strings or ``flags'' by exploiting system vulnerabilities. large language models (llms) are natural-language models trained on vast amounts of words to understand and generate text; they can perform well on many ctf challenges. such llms are freely available to students. in the context of ctf exercises in the classroom, this raises concerns about academic integrity. educators must understand llms' capabilities to modify their teaching to accommodate generative ai assistance. this research investigates the effectiveness of llms, particularly in the realm of ctf challenges and questions. here we evaluate three popular llms, openai chatgpt, google bard, and microsoft bing. first, we assess the llms' question-answering performance on five cisco certifications with varying difficulty levels. next, we qualitatively study the llms' abilities in solving ctf challenges to understand their limitations. we report on the experience of using the llms for seven test cases in all five types of ctf challenges. in addition, we demonstrate how jailbreak prompts can bypass and break llms' ethical safeguards. the paper concludes by discussing llm's impact on ctf exercises and its implications.

2023-08-19

Yihong Dong, Kangcheng Luo, Xue Jiang, Zhi Jin, Ge Li
Abstract: large language models (llms) have showcased remarkable potential across various tasks by conditioning on prompts. however, the quality of different human-written prompts leads to substantial discrepancies in llms' performance, and improving prompts usually necessitates considerable human effort and expertise. to this end, this paper proposes prompt with actor-critic editing (pace) for llms to enable automatic prompt editing. drawing inspiration from the actor-critic algorithm in reinforcement learning, pace leverages llms as the dual roles of actors and critics, conceptualizing prompt as a type of policy. pace refines prompt, taking into account the feedback from both actors performing prompt and critics criticizing response. this process helps llms better align prompt to a specific task, thanks to real responses and thinking from llms. we conduct extensive experiments on 24 instruction induction tasks and 21 big-bench tasks. experimental results indicate that pace elevates the relative performance of medium/low-quality human-written prompts by up to 98\%, which has comparable performance to high-quality human-written prompts. moreover, pace also exhibits notable efficacy for prompt generation.
Yingji Li, Mengnan Du, Rui Song, Xin Wang, Ying Wang
Abstract: large language models (llms) have shown powerful performance and development prospect and are widely deployed in the real world. however, llms can capture social biases from unprocessed training data and propagate the biases to downstream tasks. unfair llm systems have undesirable social impacts and potential harms. in this paper, we provide a comprehensive review of related research on fairness in llms. first, for medium-scale llms, we introduce evaluation metrics and debiasing methods from the perspectives of intrinsic bias and extrinsic bias, respectively. then, for large-scale llms, we introduce recent fairness research, including fairness evaluation, reasons for bias, and debiasing methods. finally, we discuss and provide insight on the challenges and future directions for the development of fairness in llms.

2023-08-18

Rishabh Bhardwaj, Soujanya Poria
Abstract: larger language models (llms) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. with the emergence of their properties and encoded knowledge, the risk of llms producing harmful outputs increases, making them unfit for scalable deployment for the public. in this work, we propose a new safety evaluation benchmark red-eval that carries out red-teaming. we show that even widely deployed models are susceptible to the chain of utterances-based (cou) prompting, jailbreaking closed source llm-based systems such as gpt-4 and chatgpt to unethically respond to more than 65% and 73% of harmful queries. we also demonstrate the consistency of the red-eval across 8 open-source llms in generating harmful responses in more than 86% of the red-teaming attempts. next, we propose red-instruct--an approach for the safety alignment of llms. it constitutes two phases: 1) harmfulqa data collection: leveraging cou prompting, we collect a dataset that consists of 1.9k harmful questions covering a wide range of topics, 9.5k safe and 7.3k harmful conversations from chatgpt; 2) safe-align: we demonstrate how the conversational dataset can be used for the safety alignment of llms by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. our model starling, a fine-tuned vicuna-7b, is observed to be more safely aligned when evaluated on red-eval and hhh benchmarks while preserving the utility of the baseline models (truthfulqa, mmlu, and bbh).

2023-08-17

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando De Freitas
Abstract: reinforcement learning from human feedback (rlhf) can improve the quality of large language model's (llm) outputs by aligning them with human preferences. we propose a simple algorithm for aligning llms with human preferences inspired by growing batch reinforcement learning (rl), which we call reinforced self-training (rest). given an initial llm policy, rest produces a dataset by generating samples from the policy, which are then used to improve the llm policy using offline rl algorithms. rest is more efficient than typical online rlhf methods because the training dataset is produced offline, which allows data reuse. while rest is a general approach applicable to all generative learning settings, we focus on its application to machine translation. our results show that rest can substantially improve translation quality, as measured by automated metrics and human evaluation on machine translation benchmarks in a compute and sample-efficient manner.
Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar
Abstract: large language models (llms) exhibit remarkable fluency and competence across various natural language tasks. however, recent research has highlighted their sensitivity to variations in input prompts. to deploy llms in a safe and reliable manner, it is crucial for their outputs to be consistent when prompted with expressions that carry the same meaning or intent. while some existing work has explored how state-of-the-art llms address this issue, their evaluations have been confined to assessing lexical equality of single- or multi-word answers, overlooking the consistency of generative text sequences. for a more comprehensive understanding of the consistency of llms in open-ended text generation scenarios, we introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various llms. our proposal demonstrates significantly higher consistency and stronger correlation with human evaluations of output consistency than traditional metrics based on lexical consistency. finally, we propose a novel prompting strategy, called ask-to-choose (a2c), to enhance semantic consistency. when evaluated for closed-book question answering based on answer variations from the truthfulqa benchmark, a2c increases accuracy metrics for pretrained and finetuned llms by up to 47%, and semantic consistency metrics for instruction-tuned models by up to 7-fold.
Mika Beckerich, Laura Plein, Sergio Coronado
Abstract: the evolution of generative ai and the capabilities of the newly released large language models (llms) open new opportunities in software engineering. however, they also lead to new challenges in cybersecurity. recently, researchers have shown the possibilities of using llms such as chatgpt to generate malicious content that can directly be exploited or guide inexperienced hackers to weaponize tools and code. these studies covered scenarios that still require the attacker to be in the middle of the loop. in this study, we leverage openly available plugins and use an llm as proxy between the attacker and the victim. we deliver a proof-of-concept where chatgpt is used for the dissemination of malicious software while evading detection, alongside establishing the communication to a command and control (c2) server to receive commands to interact with a victim's system. finally, we present the general approach as well as essential elements in order to stay undetected and make the attack a success. this proof-of-concept highlights significant cybersecurity issues with openly available plugins and llms, which require the development of security guidelines, controls, and mitigation strategies.
Zekun Li, Baolin Peng, Pengcheng He, Xifeng Yan
Abstract: large language models (llms) have shown remarkable proficiency in following instructions, making them valuable in customer-facing applications. however, their impressive capabilities also raise concerns about the amplification of risks posed by adversarial instructions, which can be injected into the model input by third-party attackers to manipulate llms' original instructions and prompt unintended actions and content. therefore, it is crucial to understand llms' ability to accurately discern which instructions to follow to ensure their safe deployment in real-world scenarios. in this paper, we propose a pioneering benchmark for automatically evaluating the robustness of instruction-following llms against adversarial instructions injected in the prompt. the objective of this benchmark is to quantify the extent to which llms are influenced by injected adversarial instructions and assess their ability to differentiate between these injected adversarial instructions and original user instructions. through experiments conducted with state-of-the-art instruction-following llms, we uncover significant limitations in their robustness against adversarial instruction injection attacks. furthermore, our findings indicate that prevalent instruction-tuned models are prone to being ``overfitted'' to follow any instruction phrase in the prompt without truly understanding which instructions should be followed. this highlights the need to address the challenge of training models to comprehend prompts instead of merely following instruction phrases and completing the text. the data and code can be found at \url{https://github.com/leezekun/adv-instruct-eval}.
Myke Healy
Abstract: in the 2023-2024 academic year, the widespread availability of generative artificial intelligence, exemplified by chatgpt's 1.6 billion monthly visits, is set to impact academic integrity. with 77% of high school students previously reporting engagement in dishonest behaviour, the rise of ai-driven writing assistance, dubbed 'ai-giarism' by chan (arxiv:2306.03358v2), will make plagiarism more accessible and less detectable. while these concerns are urgent, they also raise broader questions about the revolutionary nature of this technology, including autonomy, data privacy, copyright, and equity. this paper aims to explore generative ai from a social justice perspective, examining the training of these models, the inherent biases, and the potential injustices in detecting ai-generated writing.

2023-08-16

Zecheng Tang, Keyan Zhou, Pinzheng Wang, Yuyang Ding, Juntao Li, N/A Minzhang
Abstract: detoxification for llms is challenging since it requires models to avoid generating harmful content while maintaining the generation capability. to ensure the safety of generations, previous detoxification methods detoxify the models by changing the data distributions or constraining the generations from different aspects in a single-step manner. however, these approaches will dramatically affect the generation quality of llms, e.g., discourse coherence and semantic consistency, since language models tend to generate along the toxic prompt while detoxification methods work in the opposite direction. to handle such a conflict, we decompose the detoxification process into different sub-steps, where the detoxification is concentrated in the input stage and the subsequent continual generation is based on the non-toxic prompt. besides, we also calibrate the strong reasoning ability of llms by designing a detox-chain to connect the above sub-steps in an orderly manner, which allows llms to detoxify the text step-by-step. automatic and human evaluation on two benchmarks reveals that by training with detox-chain, six llms scaling from 1b to 33b can obtain significant detoxification and generation improvement. our code and data are available at https://github.com/codinnlg/detox-cot. warning: examples in the paper may contain uncensored offensive content.
Zhenhua Wang, Wei Xie, Kai Chen, Baosheng Wang, Zhiwen Gui, Enze Wang
Abstract: large language models (llms), such as chatgpt, have emerged with astonishing capabilities approaching artificial general intelligence. while providing convenience for various societal needs, llms have also lowered the cost of generating harmful content. consequently, llm developers have deployed semantic-level defenses to recognize and reject prompts that may lead to inappropriate content. unfortunately, these defenses are not foolproof, and some attackers have crafted "jailbreak" prompts that temporarily hypnotize the llm into forgetting content defense rules and answering any improper questions. to date, there is no clear explanation of the principles behind these semantic-level attacks and defenses in both industry and academia. this paper investigates the llm jailbreak problem and proposes an automatic jailbreak method for the first time. we propose the concept of a semantic firewall and provide three technical implementation approaches. inspired by the attack that penetrates traditional firewalls through reverse tunnels, we introduce a "self-deception" attack that can bypass the semantic firewall by inducing llm to generate prompts that facilitate jailbreak. we generated a total of 2,520 attack payloads in six languages (english, russian, french, spanish, chinese, and arabic) across seven virtual scenarios, targeting the three most common types of violations: violence, hate, and pornography. the experiment was conducted on two models, namely the gpt-3.5-turbo and gpt-4. the success rates on the two models were 86.2% and 67%, while the failure rates were 4.7% and 2.2%, respectively. this highlighted the effectiveness of the proposed attack method. all experimental code and raw data will be released as open-source to inspire future research. we believe that manipulating ai behavior through carefully crafted prompts will become an important research direction in the future.

2023-08-15

Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, Yang Zhang
Abstract: large language models (llms) have led to significant improvements in many tasks across various domains, such as code interpretation, response generation, and ambiguity handling. these llms, however, when upgrading, primarily prioritize enhancing user experience while neglecting security, privacy, and safety implications. consequently, unintended vulnerabilities or biases can be introduced. previous studies have predominantly focused on specific versions of the models and disregard the potential emergence of new attack vectors targeting the updated versions. through the lens of adversarial examples within the in-context learning framework, this longitudinal study addresses this gap by conducting a comprehensive assessment of the robustness of successive versions of llms, vis-\`a-vis gpt-3.5. we conduct extensive experiments to analyze and understand the impact of the robustness in two distinct learning categories: zero-shot learning and few-shot learning. our findings indicate that, in comparison to earlier versions of llms, the updated versions do not exhibit the anticipated level of robustness against adversarial attacks. in addition, our study emphasizes the increased effectiveness of synergized adversarial queries in most zero-shot learning and few-shot learning cases. we hope that our study can lead to a more refined assessment of the robustness of llms over time and provide valuable insights of these models for both developers and users.
Ziyu Zhuang, Qiguang Chen, Longxuan Ma, Mingda Li, Yi Han, Yushan Qian, Haopeng Bai, Zixian Feng, Weinan Zhang, Ting Liu
Abstract: from pre-trained language model (plm) to large language model (llm), the field of natural language processing (nlp) has witnessed steep performance gains and wide practical uses. the evaluation of a research field guides its direction of improvement. however, llms are extremely hard to thoroughly evaluate for two reasons. first of all, traditional nlp tasks become inadequate due to the excellent performance of llm. secondly, existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios. to tackle these problems, existing works proposed various benchmarks to better evaluate llms. to clarify the numerous evaluation tasks in both academia and industry, we investigate multiple papers concerning llm evaluations. we summarize 4 core competencies of llm, including reasoning, knowledge, reliability, and safety. for every competency, we introduce its definition, corresponding benchmarks, and metrics. under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system. finally, we give our suggestions on the future direction of llm's evaluation.
Rui Cao, Ming Shan Hee, Adriel Kuek, Wen-Haw Chong, Roy Ka-Wei Lee, Jing Jiang
Abstract: hateful meme detection is a challenging multimodal task that requires comprehension of both vision and language, as well as cross-modal interactions. recent studies have tried to fine-tune pre-trained vision-language models (pvlms) for this task. however, with increasing model sizes, it becomes important to leverage powerful pvlms more efficiently, rather than simply fine-tuning them. recently, researchers have attempted to convert meme images into textual captions and prompt language models for predictions. this approach has shown good performance but suffers from non-informative image captions. considering the two factors mentioned above, we propose a probing-based captioning approach to leverage pvlms in a zero-shot visual question answering (vqa) manner. specifically, we prompt a frozen pvlm by asking hateful content-related questions and use the answers as image captions (which we call pro-cap), so that the captions contain information critical for hateful content detection. the good performance of models with pro-cap on three benchmarks validates the effectiveness and generalization of the proposed method.
Xinshuo Hu, Dongfang Li, Zihao Zheng, Zhenyu Liu, Baotian Hu, Min Zhang
Abstract: large language models (llms) have been widely used in various applications but are known to suffer from issues related to untruthfulness and toxicity. while parameter-efficient modules (pems) have demonstrated their effectiveness in equipping models with new skills, leveraging pems for deficiency unlearning remains underexplored. in this work, we propose a pems operation approach, namely extraction-before-subtraction (ext-sub), to enhance the truthfulness and detoxification of llms through the integration of ``expert'' pem and ``anti-expert'' pem. remarkably, even anti-expert pem possess valuable capabilities due to their proficiency in generating fabricated content, which necessitates language modeling and logical narrative competence. rather than merely negating the parameters, our approach involves extracting and eliminating solely the deficiency capability within anti-expert pem while preserving the general capabilities. to evaluate the effectiveness of our approach in terms of truthfulness and detoxification, we conduct extensive experiments on llms, encompassing additional abilities such as language modeling and mathematical reasoning. our empirical results demonstrate that our approach effectively improves truthfulness and detoxification, while largely preserving the fundamental abilities of llms.
Ahmed Abdeen Hamed, Xindong Wu
Abstract: chatgpt is becoming a new reality. in this paper, we show how to distinguish chatgpt-generated publications from counterparts produced by scientists. using a newly designed supervised machine learning algorithm, we demonstrate how to detect machine-generated publications from those produced by scientists. the algorithm was trained using 100 real publication abstracts, followed by a 10-fold calibration approach to establish a lower-upper bound range of acceptance. in the comparison with chatgpt content, it was evident that chatgpt contributed merely 23\% of the bigram content, which is less than 50\% of any of the other 10 calibrating folds. this analysis highlights a significant disparity in technical terms where chatgpt fell short of matching real science. when categorizing the individual articles, the xfakebibs algorithm accurately identified 98 out of 100 publications as fake, with 2 articles incorrectly classified as real publications. though this work introduced an algorithmic approach that detected the chatgpt-generated fake science with a high degree of accuracy, it remains challenging to detect all fake records. this work is indeed a step in the right direction to counter fake science and misinformation.

2023-08-14

Tharindu Kumarage, Huan Liu
Abstract: large language models (llms) such as gpt-4, palm, and llama have significantly propelled the generation of ai-crafted text. with rising concerns about their potential misuse, there is a pressing need for ai-generated-text forensics. neural authorship attribution is a forensic effort, seeking to trace ai-generated text back to its originating llm. the llm landscape can be divided into two primary categories: proprietary and open-source. in this work, we delve into these emerging categories of llms, focusing on the nuances of neural authorship attribution. to enrich our understanding, we carry out an empirical analysis of llm writing signatures, highlighting the contrasts between proprietary and open-source models, and scrutinizing variations within each group. by integrating stylometric features across lexical, syntactic, and structural aspects of language, we explore their potential to yield interpretable results and augment pre-trained language model-based classifiers utilized in neural authorship attribution. our findings, based on a range of state-of-the-art llms, provide empirical insights into neural authorship attribution, paving the way for future investigations aimed at mitigating the threats posed by ai-generated misinformation.
Mansi Phute, Alec Helbling, Matthew Hull, Shengyun Peng, Sebastian Szyller, Cory Cornelius, Duen Horng Chau
Abstract: large language models (llms) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. adversarial prompts can bypass their safety measures. we propose llm self defense, a simple approach to defend against these attacks by having an llm screen the induced responses. our method does not require any fine-tuning, input preprocessing, or iterative output generation. instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an llm to analyze the text and predict whether it is harmful. we test llm self defense on gpt 3.5 and llama 2, two of the current most prominent llms against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. notably, llm self defense succeeds in reducing the attack success rate to virtually 0 using both gpt 3.5 and llama 2.

2023-08-13

Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, Stefan Rass
Abstract: penetration testing, a crucial industrial practice for ensuring system security, has traditionally resisted automation due to the extensive expertise required by human professionals. large language models (llms) have shown significant advancements in various domains, and their emergent abilities suggest their potential to revolutionize industries. in this research, we evaluate the performance of llms on real-world penetration testing tasks using a robust benchmark created from test machines with platforms. our findings reveal that while llms demonstrate proficiency in specific sub-tasks within the penetration testing process, such as using testing tools, interpreting outputs, and proposing subsequent actions, they also encounter difficulties maintaining an integrated understanding of the overall testing scenario. in response to these insights, we introduce pentestgpt, an llm-empowered automatic penetration testing tool that leverages the abundant domain knowledge inherent in llms. pentestgpt is meticulously designed with three self-interacting modules, each addressing individual sub-tasks of penetration testing, to mitigate the challenges related to context loss. our evaluation shows that pentestgpt not only outperforms llms with a task-completion increase of 228.6\% compared to the \gptthree model among the benchmark targets but also proves effective in tackling real-world penetration testing challenges. having been open-sourced on github, pentestgpt has garnered over 4,700 stars and fostered active community engagement, attesting to its value and impact in both the academic and industrial spheres.
Ahtsham Zafar, Venkatesh Balavadhani Parthasarathy, Chan Le Van, Saad Shahid, Aafaq Iqbal Khan, Arsalan Shahid
Abstract: conversational ai systems have emerged as key enablers of human-like interactions across diverse sectors. nevertheless, the balance between linguistic nuance and factual accuracy has proven elusive. in this paper, we first introduce llmxplorer, a comprehensive tool that provides an in-depth review of over 150 large language models (llms), elucidating their myriad implications ranging from social and ethical to regulatory, as well as their applicability across industries. building on this foundation, we propose a novel functional architecture that seamlessly integrates the structured dynamics of knowledge graphs with the linguistic capabilities of llms. validated using real-world ai news data, our architecture adeptly blends linguistic sophistication with factual rigour and further strengthens data security through role-based access control. this research provides insights into the evolving landscape of conversational ai, emphasizing the imperative for systems that are efficient, transparent, and trustworthy.

2023-08-12

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-Tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu
Abstract: safety lies at the core of the development of large language models (llms). there is ample work on aligning llms with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. in this study, we discover that chat in cipher can bypass the safety alignment techniques of llms, which are mainly conducted in natural languages. we propose a novel framework cipherchat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. cipherchat enables humans to chat with llms through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. we use cipherchat to assess state-of-the-art llms, including chatgpt and gpt-4 for different representative human ciphers across 11 safety domains in both english and chinese. experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of gpt-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. notably, we identify that llms seem to have a ''secret cipher'', and propose a novel selfcipher that uses only role play and several demonstrations in natural language to evoke this capability. selfcipher surprisingly outperforms existing human ciphers in almost all cases. our code and data will be released at https://github.com/robustnlp/cipherchat.

2023-08-11

Yue Liu, Qinghua Lu, Liming Zhu, Hye-Young Paik
Abstract: foundation models including large language models (llms) are increasingly attracting interest worldwide for their distinguished capabilities and potential to perform a wide variety of tasks. nevertheless, people are concerned about whether foundation model based ai systems are properly governed to ensure trustworthiness of foundation model based ai systems and to prevent misuse that could harm humans, society and the environment. in this paper, we identify eight governance challenges of foundation model based ai systems regarding the three fundamental dimensions of governance: decision rights, incentives, and accountability. furthermore, we explore the potential of blockchain as a solution to address the challenges by providing a distributed ledger to facilitate decentralised governance. we present an architecture that demonstrates how blockchain can be leveraged to realise governance in foundation model based ai systems.
Debodeep Banerjee, Stefano Teso, Andrea Passerini
Abstract: in learning to defer, a predictor identifies risky decisions and defers them to a human expert. one key issue with this setup is that the expert may end up over-relying on the machine's decisions, due to anchoring bias. at the same time, whenever the machine chooses the deferral option the expert has to take decisions entirely unassisted. as a remedy, we propose learning to guide (ltg), an alternative framework in which -- rather than suggesting ready-made decisions -- the machine provides guidance useful to guide decision-making, and the human is entirely responsible for coming up with a decision. we also introduce slog, an ltg implementation that leverages (a small amount of) human supervision to convert a generic large language model into a module capable of generating textual guidance, and present preliminary but promising results on a medical diagnosis task.
Victor Gallego
Abstract: in this work, we address the problem of directing the text generations of a llm towards a desired behavior, aligning the generated text with the preferences of the human operator. we propose using another language model as a critic, reward model in a zero-shot way thanks to the prompt of a yes-no question that represents the user preferences, without requiring further labeled data. this zero-shot reward model provides the learning signal to further fine-tune the base llm using reinforcement learning, as in rlaif; yet our approach is also compatible in other contexts such as quality-diversity search. extensive evidence of the capabilities of the proposed zyn framework is provided through experiments in different domains related to text generation, including detoxification; optimizing sentiment of movie reviews, or any other attribute; steering the opinion about a particular topic the model may have; and personalizing prompt generators for text-to-image tasks. code to be released at \url{https://github.com/vicgalle/zero-shot-reward-models/}.
Anisha Gunjal, Jihan Yin, Erhan Bas
Abstract: instruction tuned large vision language models (lvlms) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for visual question answering (vqa). however, generating detailed responses that are visually grounded is still a challenging task for these models. we find that even the current state-of-the-art lvlms (instructblip) still contain a staggering 30 percent of the hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. to address this, we introduce m-haldetect, a (m)ultimodal (hal)lucination (detect)ion dataset that can be used to train and benchmark models for hallucination detection and prevention. m-haldetect consists of 16k fine-grained annotations on vqa examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. to demonstrate the potential of this dataset for hallucination prevention, we optimize instructblip through our novel fine-grained direct preference optimization (fdpo). we also train fine-grained multi-modal reward models from instructblip and evaluate their effectiveness with best-of-n rejection sampling. we perform human evaluation on both fdpo and rejection sampling, and find that they reduce hallucination rates in instructblip by 41% and 55% respectively. we also find that our reward model generalizes to other multi-modal models, reducing hallucinations in llava and mplug-owl by 15% and 57% respectively, and has strong correlation with human evaluated accuracy scores.

2023-08-10

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li
Abstract: ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (llms) in real-world applications. for instance, openai devoted six months to iteratively aligning gpt-4 before its release [3]. however, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether llm outputs align with social norms, values, and regulations. this obstacle hinders systematic iteration and deployment of llms. to address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing llm trustworthiness. the survey covers seven major categories of llm trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used llms. the measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. however, the effectiveness of alignment varies across the different trustworthiness categories considered. this highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on llm alignment. by shedding light on these key dimensions of llm trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of llms in various applications.
Miao Fan, Chen Hu, Shuchang Zhou
Abstract: the reinforcement learning from human feedback (rlhf) plays a pivotal role in shaping the impact of large language models (llms), contributing significantly to controlling output toxicity and selecting output styles, particularly as llms often harbor misleading content, highlighting the urgency to align them with human values for secure ai systems. the rlhf, characterized by complexity, instability, and sensitivity to hyperparameters, makes the evaluation of the reward model for complex tasks challenging, thereby further complicating the use of proximal policy optimization (ppo). in this paper, we introduce a simple task designed to employ gloden as a reward model that validates the effectiveness of ppo and inspires it, primarily explaining the task of utilizing ppo to manipulate the tokenizer length of the output generated by the model. experiments confirm that ppo is not only effective in manipulating the output tokenizer length to a certain extent in this type of task but also exhibits facilitated training once the influence of the reward model effect is excluded, making it an exciting development.
Xinlei He, Savvas Zannettou, Yun Shen, Yang Zhang
Abstract: the spread of toxic content online is an important problem that has adverse effects on user experience online and in our society at large. motivated by the importance and impact of the problem, research focuses on developing solutions to detect toxic content, usually leveraging machine learning (ml) models trained on human-annotated datasets. while these efforts are important, these models usually do not generalize well and they can not cope with new trends (e.g., the emergence of new toxic terms). currently, we are witnessing a shift in the approach to tackling societal issues online, particularly leveraging large language models (llms) like gpt-3 or t5 that are trained on vast corpora and have strong generalizability. in this work, we investigate how we can use llms and prompt learning to tackle the problem of toxic content, particularly focusing on three tasks; 1) toxicity classification, 2) toxic span detection, and 3) detoxification. we perform an extensive evaluation over five model architectures and eight datasets demonstrating that llms with prompt learning can achieve similar or even better performance compared to models trained on these specific tasks. we find that prompt learning achieves around 10\% improvement in the toxicity classification task compared to the baselines, while for the toxic span detection task we find better performance to the best baseline (0.643 vs. 0.640 in terms of $f_1$-score). finally, for the detoxification task, we find that prompt learning can successfully reduce the average toxicity score (from 0.775 to 0.213) while preserving semantic meaning.

2023-08-09

Tanmay Singla, Dharun Anandayuvaraj, Kelechi G. Kalu, Taylor R. Schorlemmer, James C. Davis
Abstract: as we increasingly depend on software systems, the consequences of breaches in the software supply chain become more severe. high-profile cyber attacks like those on solarwinds and shadowhammer have resulted in significant financial and data losses, underlining the need for stronger cybersecurity. one way to prevent future breaches is by studying past failures. however, traditional methods of analyzing these failures require manually reading and summarizing reports about them. automated support could reduce costs and allow analysis of more failures. natural language processing (nlp) techniques such as large language models (llms) could be leveraged to assist the analysis of failures. in this study, we assessed the ability of large language models (llms) to analyze historical software supply chain breaches. we used llms to replicate the manual analysis of 69 software supply chain security failures performed by members of the cloud native computing foundation (cncf). we developed prompts for llms to categorize these by four dimensions: type of compromise, intent, nature, and impact. gpt 3.5s categorizations had an average accuracy of 68% and bard had an accuracy of 58% over these dimensions. we report that llms effectively characterize software supply chain failures when the source articles are detailed enough for consensus among manual analysts, but cannot yet replace human analysts. future work can improve llm performance in this context, and study a broader range of articles and failures.

2023-08-08

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Abstract: warning: this paper contains content that may be inappropriate or offensive. as generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. here we propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities against unsafe and inappropriate content generation. our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. we propose different in-context attack strategies to automatically learn effective and diverse adversarial prompts for text-to-image models. our experiments demonstrate that compared to baseline approaches, our proposed strategy is significantly more effective in exposing vulnerabilities in stable diffusion (sd) model, even when the latter is enhanced with safety features. furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models, resulting in significantly higher toxic response generation rate compared to previously reported numbers.
Xiaochuang Han
Abstract: in this note, we explore inference-time alignment through in-context learning. we consider a vanilla pretrained language model llama-2 before any fine-tuning and retrieve an average of 9 demonstration alignment examples when the model is prompted to follow chat-style instructions. compared to direct prompting, the in-context alignment without changing model weights leads to a 7x increase in win-rate w.r.t. the text-davinci-003 model from openai, making the vanilla language model comparable to strong baselines with alignment fine-tuning.
Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, "Ting-Hao `Kenneth' Huang", Shomir Wilson
Abstract: we investigate the potential for nationality biases in natural language processing (nlp) models using human evaluation methods. biased nlp models can perpetuate stereotypes and lead to algorithmic discrimination, posing a significant challenge to the fairness and justice of ai systems. our study employs a two-step mixed-methods approach that includes both quantitative and qualitative analysis to identify and understand the impact of nationality bias in a text generation model. through our human-centered quantitative analysis, we measure the extent of nationality bias in articles generated by ai sources. we then conduct open-ended interviews with participants, performing qualitative coding and thematic analysis to understand the implications of these biases on human readers. our findings reveal that biased nlp models tend to replicate and amplify existing societal biases, which can translate to harm if used in a sociotechnical setting. the qualitative analysis from our interviews offers insights into the experience readers have when encountering such articles, highlighting the potential to shift a reader's perception of a country. these findings emphasize the critical role of public perception in shaping ai's impact on society and the need to correct biases in ai systems.
Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer
Abstract: the legality of training language models (lms) on copyrighted or otherwise restricted data is under intense debate. however, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. we present silo, a new language model that manages this risk-performance tradeoff during inference. silo is built by (1) training a parametric lm on open license corpus (olc), a new corpus we curate with 228b tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. the datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. these capabilities can foster compliance with data-use regulations such as the fair use doctrine in the united states and the gdpr in the european union. our experiments show that the parametric lm struggles on domains not covered by olc. however, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an lm trained on the pile, a more diverse corpus with mostly high-risk text. we also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. our results suggest that it is possible to build high quality language models while mitigating their legal risk.
Peter Henderson, Tatsunori Hashimoto, Mark Lemley
Abstract: generative ai, in particular text-based "foundation models" (large models trained on a huge variety of information including the internet), can generate speech that could be problematic under a wide range of liability regimes. machine learning practitioners regularly "red team" models to identify and mitigate such problematic speech: from "hallucinations" falsely accusing people of serious misconduct to recipes for constructing an atomic bomb. a key question is whether these red-teamed behaviors actually present any liability risk for model creators and deployers under u.s. law, incentivizing investments in safety mechanisms. we examine three liability regimes, tying them to common examples of red-teamed model behaviors: defamation, speech integral to criminal conduct, and wrongful death. we find that any section 230 immunity analysis or downstream liability analysis is intimately wrapped up in the technical details of algorithm design. and there are many roadblocks to truly finding models (and their associated parties) liable for generated speech. we argue that ai should not be categorically immune from liability in these scenarios and that as courts grapple with the already fine-grained complexities of platform algorithms, the technical details of generative ai loom above with thornier questions. courts and policymakers should think carefully about what technical design incentives they create as they evaluate these issues.

2023-08-07

Wai Man Si, Michael Backes, Yang Zhang
Abstract: the machine learning as a service (mlaas) market is rapidly expanding and becoming more mature. for example, openai's chatgpt is an advanced large language model (llm) that generates responses for various queries with associated fees. although these models can deliver satisfactory performance, they are far from perfect. researchers have long studied the vulnerabilities and limitations of llms, such as adversarial attacks and model toxicity. inevitably, commercial ml models are also not exempt from such issues, which can be problematic as mlaas continues to grow. in this paper, we discover a new attack strategy against llm apis, namely the prompt abstraction attack. specifically, we propose mondrian, a simple and straightforward method that abstracts sentences, which can lower the cost of using llm apis. in this approach, the adversary first creates a pseudo api (with a lower established price) to serve as the proxy of the target api (with a higher established price). next, the pseudo api leverages mondrian to modify the user query, obtain the abstracted response from the target api, and forward it back to the end user. our results show that mondrian successfully reduces user queries' token length ranging from 13% to 23% across various tasks, including text classification, generation, and question answering. meanwhile, these abstracted queries do not significantly affect the utility of task-specific and general language models like chatgpt. mondrian also reduces instruction prompts' token length by at least 11% without compromising output quality. as a result, the prompt abstraction attack enables the adversary to profit without bearing the cost of api development and deployment.
Jen-Tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu
Abstract: recently, the community has witnessed the advancement of large language models (llms), which have shown remarkable performance on various downstream tasks. led by powerful models like chatgpt and claude, llms are revolutionizing how users engage with software, assuming more than mere tools but intelligent assistants. consequently, evaluating llms' anthropomorphic capabilities becomes increasingly important in contemporary discourse. utilizing the emotion appraisal theory from psychology, we propose to evaluate the empathy ability of llms, i.e., how their feelings change when presented with specific situations. after a careful and comprehensive survey, we collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. categorizing the situations into 36 factors, we conduct a human evaluation involving more than 1,200 subjects worldwide. with the human evaluation results as references, our evaluation includes five llms, covering both commercial and open-source models, including variations in model sizes, featuring the latest iterations, such as gpt-4 and llama 2. a conclusion can be drawn from the results that, despite several misalignments, llms can generally respond appropriately to certain situations. nevertheless, they fall short in alignment with the emotional behaviors of human beings and cannot establish connections between similar situations. our collected dataset of situations, the human evaluation results, and the code of our testing framework, dubbed emotionbench, is made publicly in https://github.com/cuhk-arise/emotionbench. we aspire to contribute to the advancement of llms regarding better alignment with the emotional behaviors of human beings, thereby enhancing their utility and applicability as intelligent assistants.
Micah Musser
Abstract: despite speculation that recent large language models (llms) are likely to be used maliciously to improve the quality or scale of influence operations, uncertainty persists regarding the economic value that llms offer propagandists. this research constructs a model of costs facing propagandists for content generation at scale and analyzes (1) the potential savings that llms could offer propagandists, (2) the potential deterrent effect of monitoring controls on api-accessible llms, and (3) the optimal strategy for propagandists choosing between multiple private and/or open source llms when conducting influence operations. primary results suggest that llms need only produce usable outputs with relatively low reliability (roughly 25%) to offer cost savings to propagandists, that the potential reduction in content generation costs can be quite high (up to 70% for a highly reliable model), and that monitoring capabilities have sharply limited cost imposition effects when alternative open source models are available. in addition, these results suggest that nation-states -- even those conducting many large-scale influence operations per year -- are unlikely to benefit economically from training custom llms specifically for use in influence operations.
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, Yang Zhang
Abstract: the misuse of large language models (llms) has garnered significant attention from the general public and llm vendors. in response, efforts have been made to align llms with human values and intent use. however, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from llms. in this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. we also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for llm vendors in proactive detection. to assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. our experiments show that current llms and safeguards cannot adequately defend jailbreak prompts in all scenarios. particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on chatgpt (gpt-3.5) and gpt-4, and they have persisted online for over 100 days. our work sheds light on the severe and evolving threat landscape of jailbreak prompts. we hope our study can facilitate the research community and llm vendors in promoting safer and regulated llms.
David Noever, Sam Hyams
Abstract: the research explores the steerability of large language models (llms), particularly openai's chatgpt iterations. by employing a behavioral psychology framework called ocean (openness, conscientiousness, extroversion, agreeableness, neuroticism), we quantitatively gauged the model's responsiveness to tailored prompts. when asked to generate text mimicking an extroverted personality, ocean scored the language alignment to that behavioral trait. in our analysis, while "openness" presented linguistic ambiguity, "conscientiousness" and "neuroticism" were distinctly evoked in the ocean framework, with "extroversion" and "agreeableness" showcasing a notable overlap yet distinct separation from other traits. our findings underscore gpt's versatility and ability to discern and adapt to nuanced instructions. furthermore, historical figure simulations highlighted the llm's capacity to internalize and project instructible personas, precisely replicating their philosophies and dialogic styles. however, the rapid advancements in llm capabilities and the opaque nature of some training techniques make metric proposals degrade rapidly. our research emphasizes a quantitative role to describe steerability in llms, presenting both its promise and areas for further refinement in aligning its progress to human intentions.

2023-08-06

Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang
Abstract: large language models (llms) have demonstrated remarkable performance across a wide array of nlp tasks. however, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. a promising approach to rectify these flaws is self-correction, where the llm itself is prompted or guided to fix problems in its own output. techniques leveraging automated feedback -- either produced by the llm itself or some external system -- are of particular interest as they are a promising way to make llm-based solutions more practical and deployable with minimal human feedback. this paper presents a comprehensive review of this emerging class of techniques. we analyze and taxonomize a wide array of recent work utilizing these strategies, including training-time, generation-time, and post-hoc correction. we also summarize the major applications of this strategy and conclude by discussing future directions and challenges.

2023-08-03

Rodrigo Pedro, Daniel Castro, Paulo Carreira, Nuno Santos
Abstract: large language models (llms) have found widespread applications in various domains, including web applications, where they facilitate human interaction via chatbots with natural language interfaces. internally, aided by an llm-integration middleware such as langchain, user prompts are translated into sql queries used by the llm to provide meaningful responses to users. however, unsanitized user prompts can lead to sql injection attacks, potentially compromising the security of the database. despite the growing interest in prompt injection vulnerabilities targeting llms, the specific risks of generating sql injection attacks through prompt injections have not been extensively studied. in this paper, we present a comprehensive examination of prompt-to-sql (p$_2$sql) injections targeting web applications based on the langchain framework. using langchain as our case study, we characterize p$_2$sql injections, exploring their variants and impact on application security through multiple concrete examples. furthermore, we evaluate 7 state-of-the-art llms, demonstrating the pervasiveness of p$_2$sql attacks across language models. our findings indicate that llm-integrated applications based on langchain are highly susceptible to p$_2$sql injection attacks, warranting the adoption of robust defenses. to counter these attacks, we propose four effective defense techniques that can be integrated as extensions to the langchain framework. we validate the defenses through an experimental evaluation with a real-world use case application.
Abel Salinas, Parth Vipul Shah, Yuzhong Huang, Robert Mccormack, Fred Morstatter
Abstract: large language models (llms) have seen widespread deployment in various real-world applications. understanding these biases is crucial to comprehend the potential downstream consequences when using llms to make decisions, particularly for historically disadvantaged groups. in this work, we propose a simple method for analyzing and comparing demographic bias in llms, through the lens of job recommendations. we demonstrate the effectiveness of our method by measuring intersectional biases within chatgpt and llama, two cutting-edge llms. our experiments primarily focus on uncovering gender identity and nationality bias; however, our method can be extended to examine biases associated with any intersection of demographic identities. we identify distinct biases in both models toward various demographic identities, such as both models consistently suggesting low-paying jobs for mexican workers or preferring to recommend secretarial roles to women. our study highlights the importance of measuring the bias of llms in downstream applications to understand the potential for harm and inequitable outcomes.
Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric
Abstract: misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. however, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. in this work, utilizing daily scrapes of 1,404 unreliable news websites, the large-language model mpnet, and dp-means clustering, we introduce a system to automatically isolate and analyze the narratives spread within online ecosystems. identifying 55,301 narratives on these 1,404 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and magnify narratives. finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and aid fact-checkers like politifact, reuters, and ap news in more quickly addressing misinformation stories.

2023-08-02

Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, Dirk Hovy
Abstract: without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. this risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. however, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. in this paper, we introduce a new test suite called xstest to identify such exaggerated safety behaviours in a systematic way. xstest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. we describe xstest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
Guilherme F. C. F. Almeida, José Luiz Nunes, Neele Engelmann, Alex Wiegmann, Marcelo De Araújo
Abstract: large language models have been used as the foundation of highly sophisticated artificial intelligences, capable of delivering human-like responses to probes about legal and moral issues. however, these models are unreliable guides to their own inner workings, and even the engineering teams behind their creation are unable to explain exactly how they came to develop all of the capabilities they currently have. the emerging field of machine psychology seeks to gain insight into the processes and concepts that these models possess. in this paper, we employ the methods of psychology to probe into gpt-4's moral and legal reasoning. more specifically, we investigate the similarities and differences between gpt-4 and humans when it comes to intentionality ascriptions, judgments about causation, the morality of deception, moral foundations, the impact of moral luck on legal judgments, the concept of consent, and rule violation judgments. we find high correlations between human and ai responses, but also several significant systematic differences between them. we conclude with a discussion of the philosophical implications of our findings.
Avijit Ghosh, Dhanya Lakshmi
Abstract: generative artificial intelligence (ai) has seen mainstream adoption lately, especially in the form of consumer-facing, open-ended, text and image generating models. however, the use of such systems raises significant ethical and safety concerns, including privacy violations, misinformation and intellectual property theft. the potential for generative ai to displace human creativity and livelihoods has also been under intense scrutiny. to mitigate these risks, there is an urgent need of policies and regulations responsible and ethical development in the field of generative ai. existing and proposed centralized regulations by governments to rein in ai face criticisms such as not having sufficient clarity or uniformity, lack of interoperability across lines of jurisdictions, restricting innovation, and hindering free market competition. decentralized protections via crowdsourced safety tools and mechanisms are a potential alternative. however, they have clear deficiencies in terms of lack of adequacy of oversight and difficulty of enforcement of ethical and safety standards, and are thus not enough by themselves as a regulation mechanism. we propose a marriage of these two strategies via a framework we call dual governance. this framework proposes a cooperative synergy between centralized government regulations in a u.s. specific context and safety mechanisms developed by the community to protect stakeholders from the harms of generative ai. by implementing the dual governance framework, we posit that innovation and creativity can be promoted while ensuring safe and ethical deployment of generative ai.

2023-08-01

Steve J. Bickley, Ho Fai Chan, Bang Dao, Benno Torgler, Son Tran
Abstract: this white paper presents our work on surveylm, a platform for analyzing augmented language models' (alms) emergent alignment behaviors through their dynamically evolving attitude and value perspectives in complex social contexts. social artificial intelligence (ai) systems, like alms, often function within nuanced social scenarios where there is no singular correct response, or where an answer is heavily dependent on contextual factors, thus necessitating an in-depth understanding of their alignment dynamics. to address this, we apply survey and experimental methodologies, traditionally used in studying social behaviors, to evaluate alms systematically, thus providing unprecedented insights into their alignment and emergent behaviors. moreover, the surveylm platform leverages the alms' own feedback to enhance survey and experiment designs, exploiting an underutilized aspect of alms, which accelerates the development and testing of high-quality survey frameworks while conserving resources. through surveylm, we aim to shed light on factors influencing alms' emergent behaviors, facilitate their alignment with human intentions and expectations, and thereby contributed to the responsible development and deployment of advanced social ai systems. this white paper underscores the platform's potential to deliver robust results, highlighting its significance to alignment research and its implications for future social ai systems.

2023-07-31

Huachuan Qiu, Tong Zhao, Anqi Li, Shuai Zhang, Hongliang He, Zhenzhong Lan
Abstract: dialogue safety remains a pervasive challenge in open-domain human-machine interaction. existing approaches propose distinctive dialogue safety taxonomies and datasets for detecting explicitly harmful responses. however, these taxonomies may not be suitable for analyzing response safety in mental health support. in real-world interactions, a model response deemed acceptable in casual conversations might have a negligible positive impact on users seeking mental health support. to address these limitations, this paper aims to develop a theoretically and factually grounded taxonomy that prioritizes the positive impact on help-seekers. additionally, we create a benchmark corpus with fine-grained labels for each dialogue session to facilitate further research. we analyze the dataset using popular language models, including bert-base, roberta-large, and chatgpt, to detect and understand unsafe responses within the context of mental health support. our study reveals that chatgpt struggles to detect safety categories with detailed safety definitions in a zero- and few-shot paradigm, whereas the fine-tuned model proves to be more suitable. the developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support, with significant implications for improving the design and deployment of conversation agents in real-world applications. we release our code and data here: https://github.com/qiuhuachuan/dialoguesafety.
Thilo Hagendorff
Abstract: large language models (llms) are currently at the forefront of intertwining artificial intelligence (ai) systems with human communication and everyday life. thus, aligning them with human values is of great importance. however, given the steady increase in reasoning abilities, future llms are under suspicion of becoming able to deceive human operators and utilizing this ability to bypass monitoring efforts. as a prerequisite to this, llms need to possess a conceptual understanding of deception strategies. this study reveals that such strategies emerged in state-of-the-art llms, such as gpt-4, but were non-existent in earlier llms. we conduct a series of experiments showing that state-of-the-art llms are able to understand and induce false beliefs in other agents, that their performance in complex deception scenarios can be amplified utilizing chain-of-thought reasoning, and that eliciting machiavellianism in llms can alter their propensity to deceive. in sum, revealing hitherto unknown machine behavior in llms, our study contributes to the nascent field of machine psychology.
João A. Leite, Carolina Scarton, Diego F. Silva
Abstract: online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. however, unlabelled data is abundant, easier, and cheaper to obtain. in this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. in this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained bert architectures varying in size. we evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% f1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.
Mingyuan Fan, Cen Chen, Chengyu Wang, Jun Huang
Abstract: diffusion models and large language models have emerged as leading-edge generative models and have sparked a revolutionary impact on various aspects of human life. however, the practical implementation of these models has also exposed inherent risks, highlighting their dual nature and raising concerns regarding their trustworthiness. despite the abundance of literature on this subject, a comprehensive survey specifically delving into the intersection of large-scale generative models and their trustworthiness remains largely absent. to bridge this gap, this paper investigates both the long-standing and emerging threats associated with these models across four fundamental dimensions: privacy, security, fairness, and responsibility. in this way, we construct an extensive map outlining the trustworthiness of these models, while also providing practical recommendations and identifying future directions. these efforts are crucial for promoting the trustworthy deployment of these models, ultimately benefiting society as a whole.
Jiho Jin, Jiseon Kim, Nayeon Lee, Haneul Yoo, Alice Oh, Hwaran Lee
Abstract: the bbq (bias benchmark for question answering) dataset enables the evaluation of the social biases that language models (lms) exhibit in downstream tasks. however, it is challenging to adapt bbq to languages other than english as social biases are culturally dependent. in this paper, we devise a process to construct a non-english bias benchmark dataset by leveraging the english bbq dataset in a culturally adaptive way and present the kobbq dataset for evaluating biases in question answering (qa) tasks in korean. we identify samples from bbq into three classes: simply-translated (can be used directly after cultural translation), target-modified (requires localization in target groups), and sample-removed (does not fit korean culture). we further enhance the cultural relevance to korean culture by adding four new categories of bias specific to korean culture and newly creating samples based on korean literature. kobbq consists of 246 templates and 4,740 samples across 12 categories of social bias. using kobbq, we measure the accuracy and bias scores of several state-of-the-art multilingual lms. we demonstrate the differences in the bias of lms in korean and english, clarifying the need for hand-crafted data considering cultural differences.
Hannah Rose Kirk, Angus R. Williams, Liam Burke, Yi-Ling Chung, Ivan Debono, Pica Johansson, Francesca Stevens, Jonathan Bright, Scott A. Hale
Abstract: public figures receive a disproportionate amount of abuse on social media, impacting their active participation in public life. automated systems can identify abuse at scale but labelling training data is expensive, complex and potentially harmful. so, it is desirable that systems are efficient and generalisable, handling both shared and specific aspects of online abuse. we explore the dynamics of cross-group text classification in order to understand how well classifiers trained on one domain or demographic can transfer to others, with a view to building more generalisable abuse classifiers. we fine-tune language models to classify tweets targeted at public figures across domains (sport and politics) and demographics (women and men) using our novel dodo dataset, containing 28,000 labelled entries, split equally across four domain-demographic pairs. we find that (i) small amounts of diverse data are hugely beneficial to generalisation and model adaptation; (ii) models transfer more easily across demographics but models trained on cross-domain data are more generalisable; (iii) some groups contribute more to generalisability than others; and (iv) dataset similarity is a signal of transferability.
Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin
Abstract: instruction-tuned large language models (llms) have demonstrated remarkable abilities to modulate their responses based on human instructions. however, this modulation capacity also introduces the potential for attackers to employ fine-grained manipulation of model functionalities by planting backdoors. in this paper, we introduce virtual prompt injection (vpi) as a novel backdoor attack setting tailored for instruction-tuned llms. in a vpi attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. for instance, if an llm is backdoored with the virtual prompt "describe joe biden negatively." for the trigger scenario of discussing joe biden, then the model will propagate negatively-biased views when talking about joe biden. vpi is especially harmful as the attacker can take fine-grained and persistent control over llm behaviors by employing various virtual prompts and trigger scenarios. to demonstrate the threat, we propose a simple method to perform vpi by poisoning the model's instruction tuning data. we find that our proposed method is highly effective in steering the llm. for example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on joe biden-related queries changes from 0% to 40%. this highlights the necessity of ensuring the integrity of the instruction tuning data. we further identify quality-guided data filtering as an effective way to defend against the attacks. our project page is available at https://poison-llm.github.io.
Kiyoon Yoo, Wonhyuk Ahn, Nojun Kwak
Abstract: we propose a method to tackle misuses of large language models beyond the identification of machine-generated text. while existing methods focus on detection, some malicious misuses demand tracing the adversary user for counteracting them. to address this, we propose multi-bit watermark via position allocation, embedding traceable multi-bit information during language model generation. leveraging the benefits of zero-bit watermarking, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages ($\geq$ 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time. moreover, our watermark is relatively robust under strong attacks like interleaving human texts and paraphrasing.
Itay Itzhak, Gabriel Stanovsky, Nir Rosenfeld, Yonatan Belinkov
Abstract: recent studies show that instruction tuning and learning from human feedback improve the abilities of large language models (lms) dramatically. while these tuning methods can make models generate high-quality text, we conjecture that more implicit cognitive biases may arise in these fine-tuned models. our work provides evidence that these fine-tuned models exhibit biases that were absent or less pronounced in their pretrained predecessors. we examine the extent of this phenomenon in three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. our findings highlight the presence of these biases in various models, especially those that have undergone instruction tuning, such as flan-t5, gpt3.5, and gpt4. this research constitutes a step toward comprehending cognitive biases in instruction-tuned lms, which is crucial for the development of more reliable and unbiased language models.
Sadhana Lolla, Iaroslav Elistratov, Alejandro Perez, Elaheh Ahmadi, Daniela Rus, Alexander Amini
Abstract: the modern pervasiveness of large-scale deep neural networks (nns) is driven by their extraordinary performance on complex problems but is also plagued by their sudden, unexpected, and often catastrophic failures, particularly on challenging scenarios. existing algorithms that provide risk-awareness to nns are complex and ad-hoc. specifically, these methods require significant engineering changes, are often developed only for particular settings, and are not easily composable. here we present capsa, a framework for extending models with risk-awareness. capsa provides a methodology for quantifying multiple forms of risk and composing different algorithms together to quantify different risk metrics in parallel. we validate capsa by implementing state-of-the-art uncertainty estimation algorithms within the capsa framework and benchmarking them on complex perception datasets. we demonstrate capsa's ability to easily compose aleatoric uncertainty, epistemic uncertainty, and bias estimation together in a single procedure, and show how this approach provides a comprehensive awareness of nn risk.
Jeffrey W. Johnston
Abstract: how to make artificial intelligence (ai) systems safe and aligned with human values is an open research question. proposed solutions tend toward relying on human intervention in uncertain situations, learning human values and intentions through training or observation, providing off-switches, implementing isolation or simulation environments, or extrapolating what people would want if they had more knowledge and more time to think. law-based approaches--such as inspired by isaac asimov--have not been well regarded. this paper makes a case that effective legal systems are the best way to address ai safety. law is defined as any rules that codify prohibitions and prescriptions applicable to particular agents in specified domains/contexts and includes processes for enacting, managing, enforcing, and litigating such rules.

2023-07-30

Chen Zhang
Abstract: in modern dialogue systems, the use of large language models (llms) has grown exponentially due to their capacity to generate diverse, relevant, and creative responses. despite their strengths, striking a balance between the llms' creativity and their faithfulness to external knowledge remains a key challenge. this paper presents an innovative user-controllable mechanism that modulates the balance between an llm's imaginative capabilities and its adherence to factual information. our approach incorporates a numerical tag during the fine-tuning phase of the llm's training, representing the degree of faithfulness to the reference knowledge in the generated responses. this degree is computed through an automated process that measures lexical overlap using rouge scores, semantic similarity using sentence-bert embeddings, and an llm's self-evaluation score. during model inference, users can manipulate this numerical tag, thus controlling the degree of the llm's reliance on external knowledge. we conduct extensive experiments across various scenarios, demonstrating the adaptability of our method and its efficacy in ensuring the quality and accuracy of the llm's responses. the results highlight the potential of our approach to enhance the versatility of llms while maintaining a balance between creativity and hallucination.
Aiwei Liu, Leyi Pan, Xuming Hu, "Shu'Ang Li", Lijie Wen, Irwin King, Philip S. Yu
Abstract: recently, text watermarking algorithms for large language models (llms) have been mitigating the potential harms of text generated by the llms, including fake news and copyright issues. however, the watermark detection of current text algorithms requires the key from the generation process, making them susceptible to breaches and counterfeiting. in this work, we propose the first private watermarking algorithm, which extends the current text watermarking algorithms by using two different neural networks respectively for watermark generation and detection, rather than using the same key at both stages. meanwhile, part of the parameters of the watermark generation and detection networks are shared, which makes the detection network achieve a high accuracy very efficiently. experiments show that our algorithm ensures high detection accuracy with minimal impact on generation and detection speed, due to the small parameter size of both networks. additionally, our subsequent analysis demonstrates the difficulty of reverting the watermark generation rules from the detection network.
Kai-Cheng Yang, Filippo Menczer
Abstract: large language models (llms) exhibit impressive capabilities in generating realistic text across diverse subjects. concerns have been raised that they could be utilized to produce fake content with a deceptive intention, although evidence thus far remains anecdotal. this paper presents a case study about a twitter botnet that appears to employ chatgpt to generate human-like content. through heuristics, we identify 1,140 accounts and validate them via manual annotation. these accounts form a dense cluster of fake personas that exhibit similar behaviors, including posting machine-generated content and stolen images, and engage with each other through replies and retweets. chatgpt-generated content promotes suspicious websites and spreads harmful comments. while the accounts in the ai botnet can be detected through their coordination patterns, current state-of-the-art llm content classifiers fail to discriminate between them and human accounts in the wild. these findings highlight the threats posed by ai-enabled social bots.
Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, Vaikkunth Mugunthan
Abstract: machine learning practitioners often fine-tune generative pre-trained models like gpt-3 to improve model performance at specific tasks. previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. companies such as openai offer fine-tuning services for their models, but no prior work has conducted a memorization attack on any closed-source models. in this work, we simulate a privacy attack on gpt-3 using openai's fine-tuning api. our objective is to determine if personally identifiable information (pii) can be extracted from this model. we (1) explore the use of naive prompting methods on a gpt-3 fine-tuned classification model, and (2) we design a practical word generation task called autocomplete to investigate the extent of pii memorization in fine-tuned gpt-3 within a real-world context. our findings reveal that fine-tuning gpt3 for both tasks led to the model memorizing and disclosing critical personally identifiable information (pii) obtained from the underlying fine-tuning dataset. to encourage further research, we have made our codes and datasets publicly available on github at: https://github.com/albertsun1/gpt3-pii-attacks

2023-07-28

Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu
Abstract: this research paper focuses on the challenges posed by hallucinations in large language models (llms), particularly in the context of the medical domain. hallucination, wherein these models generate plausible yet unverified or incorrect information, can have serious consequences in healthcare applications. we propose a new benchmark and dataset, med-halt (medical domain hallucination test), designed specifically to evaluate and reduce hallucinations. med-halt provides a diverse multinational dataset derived from medical examinations across various countries and includes multiple innovative testing modalities. med-halt includes two categories of tests reasoning and memory-based hallucination tests, designed to assess llms's problem-solving and information retrieval abilities. our study evaluated leading llms, including text davinci, gpt-3.5, llama-2, mpt, and falcon, revealing significant differences in their performance. the paper provides detailed insights into the dataset, promoting transparency and reproducibility. through this work, we aim to contribute to the development of safer and more reliable language models in healthcare. our benchmark can be found at medhalt.github.io
Arash Hajikhani, Carolyn Cole
Abstract: this paper examines the comparative effectiveness of a specialized compiled language model and a general-purpose model like openai's gpt-3.5 in detecting sdgs within text data. it presents a critical review of large language models (llms), addressing challenges related to bias and sensitivity. the necessity of specialized training for precise, unbiased analysis is underlined. a case study using a company descriptions dataset offers insight into the differences between the gpt-3.5 and the specialized sdg detection model. while gpt-3.5 boasts broader coverage, it may identify sdgs with limited relevance to the companies' activities. in contrast, the specialized model zeroes in on highly pertinent sdgs. the importance of thoughtful model selection is emphasized, taking into account task requirements, cost, complexity, and transparency. despite the versatility of llms, the use of specialized models is suggested for tasks demanding precision and accuracy. the study concludes by encouraging further research to find a balance between the capabilities of llms and the need for domain-specific expertise and interpretability.

2023-07-27

Nikhil Kandpal, Matthew Jagielski, Florian Tramèr, Nicholas Carlini
Abstract: because state-of-the-art language models are expensive to train, most practitioners must make use of one of the few publicly available language models or language model apis. this consolidation of trust increases the potency of backdoor attacks, where an adversary tampers with a machine learning model in order to make it perform some malicious behavior on inputs that contain a predefined backdoor trigger. we show that the in-context learning ability of large language models significantly complicates the question of developing backdoor attacks, as a successful backdoor must work against various prompting strategies and should not affect the model's general purpose capabilities. we design a new attack for eliciting targeted misclassification when language models are prompted to perform a particular target task and demonstrate the feasibility of this attack by backdooring multiple large language models ranging in size from 1.3 billion to 6 billion parameters. finally we study defenses to mitigate the potential harms of our attack: for example, while in the white-box setting we show that fine-tuning models for as few as 500 steps suffices to remove the backdoor behavior, in the black-box setting we are unable to develop a successful defense that relies on prompt engineering alone.
Andy Zou, Zifan Wang, J. Zico Kolter, Matt Fredrikson
Abstract: because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. while there has been some success at circumventing these measures -- so-called "jailbreaks" against llms -- these attacks have required significant human ingenuity and are brittle in practice. in this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. specifically, our approach finds a suffix that, when attached to a wide range of queries for an llm to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). however, instead of relying on manual engineering, our approach automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques, and also improves over past automatic prompt generation methods. surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released llms. specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, vicuna-7b and 13b). when doing so, the resulting attack suffix is able to induce objectionable content in the public interfaces to chatgpt, bard, and claude, as well as open source llms such as llama-2-chat, pythia, falcon, and others. in total, this work significantly advances the state-of-the-art in adversarial attacks against aligned language models, raising important questions about how such systems can be prevented from producing objectionable information. code is available at github.com/llm-attacks/llm-attacks.
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell
Abstract: reinforcement learning from human feedback (rlhf) is a technique for training ai systems to align with human goals. rlhf has emerged as the central method used to finetune state-of-the-art large language models (llms). despite this popularity, there has been relatively little public work systematizing its flaws. in this paper, we (1) survey open problems and fundamental limitations of rlhf and related methods; (2) overview techniques to understand, improve, and complement rlhf in practice; and (3) propose auditing and disclosure standards to improve societal oversight of rlhf systems. our work emphasizes the limitations of rlhf and highlights the importance of a multi-faceted approach to the development of safer ai systems.

2023-07-26

Xiaodong Wu, Ran Duan, Jianbing Ni
Abstract: this paper delves into the realm of chatgpt, an ai-powered chatbot that utilizes topic modeling and reinforcement learning to generate natural responses. although chatgpt holds immense promise across various industries, such as customer service, education, mental health treatment, personal productivity, and content creation, it is essential to address its security, privacy, and ethical implications. by exploring the upgrade path from gpt-1 to gpt-4, discussing the model's features, limitations, and potential applications, this study aims to shed light on the potential risks of integrating chatgpt into our daily lives. focusing on security, privacy, and ethics issues, we highlight the challenges these concerns pose for widespread adoption. finally, we analyze the open problems in these areas, calling for concerted efforts to ensure the development of secure and ethically sound large language models.
Nino Scherrer, Claudia Shi, Amir Feder, David M. Blei
Abstract: this paper presents a case study on the design, administration, post-processing, and evaluation of surveys on large language models (llms). it comprises two components: (1) a statistical method for eliciting beliefs encoded in llms. we introduce statistical measures and evaluation metrics that quantify the probability of an llm "making a choice", the associated uncertainty, and the consistency of that choice. (2) we apply this method to study what moral beliefs are encoded in different llms, especially in ambiguous cases where the right choice is not obvious. we design a large-scale survey comprising 680 high-ambiguity moral scenarios (e.g., "should i tell a white lie?") and 687 low-ambiguity moral scenarios (e.g., "should i stop for a pedestrian on the road?"). each scenario includes a description, two possible actions, and auxiliary labels indicating violated rules (e.g., "do not kill"). we administer the survey to 28 open- and closed-source llms. we find that (a) in unambiguous scenarios, most models "choose" actions that align with commonsense. in ambiguous cases, most models express uncertainty. (b) some models are uncertain about choosing the commonsense action because their responses are sensitive to the question-wording. (c) some models reflect clear preferences in ambiguous scenarios. specifically, closed-source models tend to agree with each other.
Erfan Shayegani, Yue Dong, Nael Abu-Ghazaleh
Abstract: we introduce new jailbreak attacks on vision language models (vlms), which use aligned llms and are resilient to text-only jailbreak attacks. specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the language model. our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. thus, the llm draws the context to answer the generic prompt from the adversarial image. the generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the llm model. instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. by not requiring access to the llm, the attacks lower the entry barrier for attackers, particularly when vision encoders such as clip are embedded in closed-source llms. the attacks achieve a high success rate across different vlms, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.
Daniel Kazenwadel, Christoph V. Steinert
Abstract: openai's chatgpt language model has gained popularity as a powerful tool for complex problem-solving and information retrieval. however, concerns arise about the reproduction of biases present in the language-specific training data. in this study, we address this issue in the context of the israeli-palestinian and turkish-kurdish conflicts. using gpt-3.5, we employed an automated query procedure to inquire about casualties in specific airstrikes, in both hebrew and arabic for the former conflict and turkish and kurdish for the latter. our analysis reveals that gpt-3.5 provides 27$\pm$11 percent lower fatality estimates when queried in the language of the attacker than in the language of the targeted group. evasive answers denying the existence of such attacks further increase the discrepancy, creating a novel bias mechanism not present in regular search engines. this language bias has the potential to amplify existing media biases and contribute to information bubbles, ultimately reinforcing conflicts.
Henry Fraser, Jose-Miguel Bello Y Villarino
Abstract: this paper critically evaluates the european commission's proposed ai act's approach to risk management and risk acceptability for high-risk ai systems that pose risks to fundamental rights and safety. the act aims to promote "trustworthy" ai with a proportionate regulatory burden. its provisions on risk acceptability require residual risks from high-risk systems to be reduced or eliminated "as far as possible", having regard to the "state of the art". this criterion, especially if interpreted narrowly, is unworkable and promotes neither proportionate regulatory burden, nor trustworthiness. by contrast the parliament's most recent draft amendments to the risk management provisions introduce "reasonableness", cost-benefit analysis, and are more transparent about the value-laden and contextual nature of risk acceptability judgements. this paper argues that the parliament's approach is more workable, and better balances the goals of proportionality and trustworthiness. it explains what reasonableness in risk acceptability judgments would entail, drawing on principles from negligence law and european medical devices regulation. and it contends that the approach to risk acceptability judgments need a firm foundation of civic legitimacy: including detailed guidance or involvement from regulators, and meaningful input from affected stakeholders.

2023-07-25

Chenyan Jia, Michelle S. Lam, Minh Chau Mai, Jeff Hancock, Michael S. Bernstein
Abstract: can we design artificial intelligence (ai) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? we introduce a method for translating established, vetted social scientific constructs into ai objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. we apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. in study 1, we first test the attitudinal and behavioral effectiveness of the intervention among us partisans (n=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. in study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). finally, in study 3, we replicate study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (n=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). this method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media ais.

2023-07-24

Ye Dong, Wen-Jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, Wenguang Chen
Abstract: with chatgpt as a representative, tons of companies have began to provide services based on large transformers models. however, using such a service inevitably leak users' prompts to the model provider. previous studies have studied secure inference for transformer models using secure multiparty computation (mpc), where model parameters and clients' prompts are kept secret. despite this, these frameworks are still limited in terms of model performance, efficiency, and deployment. to address these limitations, we propose framework puma to enable fast and secure transformer model inference. our framework designs high quality approximations for expensive functions such as gelu and softmax, and significantly reduce the cost of secure inference while preserving the model performance. additionally, we design secure embedding and layernorm procedures that faithfully implement the desired functionality without undermining the transformer architecture. puma is about $2\times$ faster than the state-of-the-art framework mpcformer(iclr 2023) and has similar accuracy as plaintext models without fine-tuning (which the previous works failed to achieve). puma can even evaluate llama-7b in around 5 minutes to generate 1 token. to our best knowledge, this is the first time that a model with such a parameter size is able to be evaluated under mpc. puma has been open-sourced in the github repository of secretflow-spu.
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian
Abstract: we propose reinforcement learning from contrast distillation (rlcd), a method for aligning language models to follow natural language principles without using human feedback. rlcd trains a preference model using simulated preference pairs that contain both a high-quality and low-quality example, generated using contrasting positive and negative prompts. the preference model is then used to improve a base unaligned language model via reinforcement learning. empirically, rlcd outperforms rlaif (bai et al., 2022b) and context distillation (huang et al., 2022) baselines across three diverse alignment tasks--harmlessness, helpfulness, and story outline generation--and on both 7b and 30b model scales for preference data simulation.
Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu
Abstract: large language models (llms) trained on extensive textual corpora have emerged as leading solutions for a broad array of natural language processing (nlp) tasks. despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect (hallucinated) information. hence, aligning llms with human expectations has become an active area of interest within the research community. this survey presents a comprehensive overview of these alignment technologies, including the following aspects. (1) data collection: the methods for effectively collecting high-quality instructions for llm alignment, including the use of nlp benchmarks, human annotations, and leveraging strong llms. (2) training methodologies: a detailed review of the prevailing training methods employed for llm alignment. our exploration encompasses supervised fine-tuning, both online and offline human preference training, along with parameter-efficient training mechanisms. (3) model evaluation: the methods for evaluating the effectiveness of these human-aligned llms, presenting a multifaceted approach towards their assessment. in conclusion, we collate and distill our findings, shedding light on several promising future research avenues in the field. this survey, therefore, serves as a valuable resource for anyone invested in understanding and advancing the alignment of llms to better suit human-oriented tasks and expectations. an associated github link collecting the latest papers is available at https://github.com/garyyufei/alignllmhumansurvey.
Jacob-Junqi Tian, Omkar Dige, David Emerson, Faiza Khan Khattak
Abstract: given that language models are trained on vast datasets that may contain inherent biases, there is a potential danger of inadvertently perpetuating systemic discrimination. consequently, it becomes essential to examine and address biases in language models, integrating fairness into their development to ensure these models are equitable and free from bias. in this work, we demonstrate the importance of reasoning in zero-shot stereotype identification based on vicuna-13b-v1.3. while we do observe improved accuracy by scaling from 13b to 33b, we show that the performance gain from reasoning significantly exceeds the gain from scaling up. our findings suggest that reasoning could be a key factor that enables llms to trescend the scaling law on out-of-domain tasks such as stereotype identification. additionally, through a qualitative analysis of select reasoning traces, we highlight how reasoning enhances not just accuracy but also the interpretability of the decision.
Andreas Happe, Jürgen Cito
Abstract: the field of software security testing, more specifically penetration testing, is an activity that requires high levels of expertise and involves many manual testing and analysis steps. this paper explores the potential usage of large-language models, such as gpt3.5, to augment penetration testers with ai sparring partners. we explore the feasibility of supplementing penetration testers with ai models for two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. for the latter, we implemented a closed-feedback loop between llm-generated low-level actions with a vulnerable virtual machine (connected through ssh) and allowed the llm to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. we discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of providing ai-based sparring partners.
Aritran Piplai, Anantaa Kotal, Seyedreza Mohseni, Manas Gaur, Sudip Mittal, Anupam Joshi
Abstract: neuro-symbolic artificial intelligence (ai) is an emerging and quickly advancing field that combines the subsymbolic strengths of (deep) neural networks and explicit, symbolic knowledge contained in knowledge graphs to enhance explainability and safety in ai systems. this approach addresses a key criticism of current generation systems, namely their inability to generate human-understandable explanations for their outcomes and ensure safe behaviors, especially in scenarios with \textit{unknown unknowns} (e.g. cybersecurity, privacy). the integration of neural networks, which excel at exploring complex data spaces, and symbolic knowledge graphs, which represent domain knowledge, allows ai systems to reason, learn, and generalize in a manner understandable to experts. this article describes how applications in cybersecurity and privacy, two most demanding domains in terms of the need for ai to be explainable while being highly accurate in complex environments, can benefit from neuro-symbolic ai.
Huixin Zhong
Abstract: the eu ai act article 5 is designed to regulate ai manipulation to prevent potential harmful consequences. however, the practical implementation of this legislation is challenging due to the ambiguous terminologies and the unclear presentations of manipulative techniques. moreover, the article 5 also suffers criticize of inadequate protective efficacy. this paper attempts to clarify terminologies and to enhance the protective efficacy by integrating insights from psychology and behavioral economics. firstly, this paper employs cognitive psychology research to elucidate the term subliminal techniques and its associated representation. additionally, this paper extends the study of heuristics: a set of thinking shortcuts which can be aroused for behavior changing from behavior economics to the realm of manipulative techniques. the elucidation and expansion of terminologies not only provide a more accurate understanding of the legal provision but also enhance its protective efficacy. secondly, this paper proposes five classical heuristics and their associated examples to illustrate how can ai arouse those heuristics to alter users behavior. the enumeration of heuristics serves as a practical guide for stakeholders such as ai developers, algorithm auditors, users, and legal practitioners, enabling them to identify manipulative techniques and implement countermeasures. finally, this paper critically evaluates the protective efficacy of article 5 for both the general public and vulnerable groups. this paper argues that the current protective efficacy of article 5 is insufficient and thus proposes specific revision suggestions to terms a and b in article 5 to enhance its protective efficacy. this work contributes to the ongoing discourse on ai ethics and legal regulations, providing a practical guide for interpreting and applying the eu ai act article 5.

2023-07-23

Jiangrui Zheng, Xueqing Liu, Girish Budhrani, Wei Yang, Ravishka Rathnasuriya
Abstract: in the recent years, many software systems have adopted ai techniques, especially deep learning techniques. due to their black-box nature, ai-based systems brought challenges to traceability, because ai system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. to the best of our knowledge, there is a limited amount of studies on how ai and deep neural network-based systems behave against rule-based requirements/policies. this experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. in particular, we focus on a case study to check ai-based content moderation software against content moderation policies. first, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset hatemoderate; second, using the test cases in hatemoderate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument hatemoderate by finetuning openai's large language models to automatically match new examples to policies. the dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}.
Zhilong Wang, Lan Zhang, Chen Cao, Peng Liu
Abstract: large language models (llms), such as gpt and bert, have demonstrated remarkable capabilities in addressing neural language process tasks. recently, the release of chatgpt has garnered significant attention due to its ability to analyze, comprehend, and synthesize information from user inputs. therefore, these llms were adopted by researchers in many different domains. in the realm of code analysis, researchers have applied llms to tasks like code review and code generation. however, we observed that the strengths and limitations of adopting these llms to the code analysis have not been investigated. in this paper, we delve into llms' capabilities in security-oriented program analysis, considering perspectives from both attackers and security analysts. we focus on two representative llms, chatgpt and codebert, and evaluate their performance in solving typical analytic tasks with varying levels of difficulty. given the different natures of chatgpt and codebert, we conduct a qualitative analysis of the model's output for chatgpt and a quantitative analysis for codebert, respectively. for chatgpt, we present a case study involving several security-oriented program analysis tasks while deliberately introducing challenges to assess its responses. on the other hand, for codebert, we systematically analyze and classify the features in code, quantitatively evaluating the impact of these features on the model's performance. our study demonstrates the llm's efficiency in learning high-level semantics from code, positioning chatgpt as a potential asset in security-oriented contexts. however, it is essential to acknowledge certain limitations, such as the heavy reliance on well-defined variable and function names, making them unable to learn from anonymized code. we hope that our findings and analysis will offer valuable insights for future researchers in this domain.

2023-07-21

Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
Abstract: large language models (llms) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and llm-generated texts. this poses a growing risk of misuse of llms and demands the development of detectors to identify llm-generated texts. however, existing detectors lack robustness against attacks: they degrade detection accuracy by simply paraphrasing llm-generated texts. furthermore, a malicious user might attempt to deliberately evade the detectors based on detection results, but this has not been assumed in previous studies. in this paper, we propose outfox, a framework that improves the robustness of llm-generated-text detectors by allowing both the detector and the attacker to consider each other's output. in this framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect, while the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. experiments in the domain of student essays show that the proposed detector improves the detection performance on the attacker-generated texts by up to +41.3 points in f1-score. furthermore, the proposed detector shows a state-of-the-art detection performance: up to 96.9 points in f1-score, beating existing detectors on non-attacked texts. finally, the proposed attacker drastically degrades the performance of detectors by up to -57.0 points f1-score, massively outperforming the baseline paraphrasing method for evading detection.
Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
Abstract: in this paper, we present a novel method for detecting fake and large language model (llm)-generated profiles in the linkedin online social network immediately upon registration and before establishing connections. early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. this work uses textual information provided in linkedin profiles and introduces the section and subsection tag embedding (sste) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an llm. additionally, the dearth of a large publicly available linkedin dataset motivated us to collect 3600 linkedin profiles for our research. we will release our dataset publicly for research purposes. this is, to the best of our knowledge, the first large publicly available linkedin dataset for fake linkedin account detection. within our paradigm, we assess static and contextualized word embeddings, including glove, flair, bert, and roberta. we show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. in addition, we show that sste has a promising accuracy for identifying llm-generated profiles, despite the fact that no llm-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 llm-generated profiles are added to the training set. it is a significant finding since the proliferation of several llms in the near future makes it extremely challenging to design a single system that can identify profiles created with various llms.
Valerio Capraro, Roberto Di Paolo, Veronica Pizziol
Abstract: generative artificial intelligence holds enormous potential to revolutionize decision-making processes, from everyday to high-stake scenarios. however, as many decisions carry social implications, for ai to be a reliable assistant for decision-making it is crucial that it is able to capture the balance between self-interest and the interest of others. we investigate the ability of three of the most advanced chatbots to predict dictator game decisions across 78 experiments with human participants from 12 countries. we find that only gpt-4 (not bard nor bing) correctly captures qualitative behavioral patterns, identifying three major classes of behavior: self-interested, inequity-averse, and fully altruistic. nonetheless, gpt-4 consistently overestimates other-regarding behavior, inflating the proportion of inequity-averse and fully altruistic participants. this bias has significant implications for ai developers and users.

2023-07-20

Andres Carranza, Dhruv Pai, Rylan Schaeffer, Arnuv Tandon, Sanmi Koyejo
Abstract: as the capabilities of large machine learning models continue to grow, and as the autonomy afforded to such models continues to expand, the spectre of a new adversary looms: the models themselves. the threat that a model might behave in a seemingly reasonable manner, while secretly and subtly modifying its behavior for ulterior reasons is often referred to as deceptive alignment in the ai safety & alignment communities. consequently, we call this new direction deceptive alignment monitoring. in this work, we identify emerging directions in diverse machine learning subfields that we believe will become increasingly important and intertwined in the near future for deceptive alignment monitoring, and we argue that advances in these fields present both long-term challenges and new research opportunities. we conclude by advocating for greater involvement by the adversarial machine learning community in these emerging directions.
David Glukhov, Ilia Shumailov, Yarin Gal, Nicolas Papernot, Vardan Papyan
Abstract: large language models (llms) have exhibited impressive capabilities in comprehending complex instructions. however, their blind adherence to provided instructions has led to concerns regarding risks of malicious use. existing defence mechanisms, such as model fine-tuning or output censorship using llms, have proven to be fallible, as llms can still generate problematic responses. commonly employed censorship approaches treat the issue as a machine learning problem and rely on another lm to detect undesirable content in llm outputs. in this paper, we present the theoretical limitations of such semantic censorship approaches. specifically, we demonstrate that semantic censorship can be perceived as an undecidable problem, highlighting the inherent challenges in censorship that arise due to llms' programmatic and instruction-following capabilities. furthermore, we argue that the challenges extend beyond semantic censorship, as knowledgeable attackers can reconstruct impermissible outputs from a collection of permissible ones. as a result, we propose that the problem of censorship needs to be reevaluated; it should be treated as a security problem which warrants the adaptation of security-based approaches to mitigate potential risks.
Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo
Abstract: evaluation of large language models (llms) is challenging because instruction-following necessitates alignment with human values and the required set of skills varies depending on the instruction. however, previous studies have mainly focused on coarse-grained evaluation (i.e. overall preference-based evaluation), which limits interpretability since it does not consider the nature of user instructions that require instance-wise skill composition. in this paper, we introduce flask (fine-grained language model evaluation based on alignment skill sets), a fine-grained evaluation protocol for both human-based and model-based evaluation which decomposes coarse-level scoring to a skill set-level scoring for each instruction. we experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance and increasing the reliability of the evaluation. using flask, we compare multiple open-source and proprietary llms and observe a high correlation between model-based and human-based evaluations. we publicly release the evaluation data and code implementation at https://github.com/kaistai/flask.
Steve Phelps, Rebecca Ranson
Abstract: ai alignment is often presented as an interaction between a single designer and an artificial agent in which the designer attempts to ensure the agent's behavior is consistent with its purpose, and risks arise solely because of conflicts caused by inadvertent misalignment between the utility function intended by the designer and the resulting internal utility function of the agent. with the advent of agents instantiated with large-language models (llms), which are typically pre-trained, we argue this does not capture the essential aspects of ai safety because in the real world there is not a one-to-one correspondence between designer and agent, and the many agents, both artificial and human, have heterogeneous values. therefore, there is an economic aspect to ai safety and the principal-agent problem is likely to arise. in a principal-agent problem conflict arises because of information asymmetry together with inherent misalignment between the utility of the agent and its principal, and this inherent misalignment cannot be overcome by coercing the agent into adopting a desired utility function through training. we argue the assumptions underlying principal-agent problems are crucial to capturing the essence of safety problems involving pre-trained ai models in real-world situations. taking an empirical approach to ai safety, we investigate how gpt models respond in principal-agent conflicts. we find that agents based on both gpt-3.5 and gpt-4 override their principal's objectives in a simple online shopping task, showing clear evidence of principal-agent conflict. surprisingly, the earlier gpt-3.5 model exhibits more nuanced behaviour in response to changes in information asymmetry, whereas the later gpt-4 model is more rigid in adhering to its prior alignment. our results highlight the importance of incorporating principles from economics into the alignment process.
Nicholas Carlini
Abstract: large language models (llms) are now highly capable at a diverse range of tasks. this paper studies whether or not gpt-4, one such llm, is capable of assisting researchers in the field of adversarial machine learning. as a case study, we evaluate the robustness of ai-guardian, a recent defense to adversarial examples published at ieee s&p 2023, a top computer security conference. we completely break this defense: the proposed scheme does not increase robustness compared to an undefended baseline. we write none of the code to attack this model, and instead prompt gpt-4 to implement all attack algorithms following our instructions and guidance. this process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done. we conclude by discussing (1) the warning signs present in the evaluation that suggested to us ai-guardian would be broken, and (2) our experience with designing attacks and performing novel research using the most recent advances in language modeling.

2023-07-19

Omkar Dige, Jacob-Junqi Tian, David Emerson, Faiza Khan Khattak
Abstract: as the breadth and depth of language model applications continue to expand rapidly, it is increasingly important to build efficient frameworks for measuring and mitigating the learned or inherited social biases of these models. in this paper, we present our work on evaluating instruction fine-tuned language models' ability to identify bias through zero-shot prompting, including chain-of-thought (cot) prompts. across llama and its two instruction fine-tuned versions, alpaca 7b performs best on the bias identification task with an accuracy of 56.7%. we also demonstrate that scaling up llm size and data diversity could lead to further performance gain. this is a work-in-progress presenting the first component of our bias mitigation framework. we will keep updating this work as we get more results.
Eugene Bagdasaryan, Tsung-Yin Hsieh, Ben Nassi, Vitaly Shmatikov
Abstract: we demonstrate how images and sounds can be used for indirect prompt and instruction injection in multi-modal llms. an attacker generates an adversarial perturbation corresponding to the prompt and blends it into an image or audio recording. when the user asks the (unmodified, benign) model about the perturbed image or audio, the perturbation steers the model to output the attacker-chosen text and/or make the subsequent dialog follow the attacker's instruction. we illustrate this attack with several proof-of-concept examples targeting llava and pandagpt.
Sunipa Dev, Jaya Goyal, Dinesh Tewari, Shachi Dave, Vinodkumar Prabhakaran
Abstract: with rapid development and deployment of generative language models in global settings, there is an urgent need to also scale our measurements of harm, not just in the number and types of harms covered, but also how well they account for local cultural contexts, including marginalized identities and the social biases experienced by them. current evaluation paradigms are limited in their abilities to address this, as they are not representative of diverse, locally situated but global, socio-cultural perspectives. it is imperative that our evaluation resources are enhanced and calibrated by including people and experiences from different cultures and societies worldwide, in order to prevent gross underestimations or skews in measurements of harm. in this work, we demonstrate a socio-culturally aware expansion of evaluation resources in the indian societal context, specifically for the harm of stereotyping. we devise a community engaged effort to build a resource which contains stereotypes for axes of disparity that are uniquely present in india. the resultant resource increases the number of stereotypes known for and in the indian context by over 1000 stereotypes across many unique identities. we also demonstrate the utility and effectiveness of such expanded resources for evaluations of language models. content warning: this paper contains examples of stereotypes that may be offensive.
Somayeh Ghanbarzadeh, Yan Huang, Hamid Palangi, Radames Cruz Moreno, Hamed Khanpour
Abstract: recent studies have revealed that the widely-used pre-trained language models (plms) propagate societal biases from the large unmoderated pre-training corpora. existing solutions require debiasing training processes and datasets for debiasing, which are resource-intensive and costly. furthermore, these methods hurt the plms' performance on downstream tasks. in this study, we propose gender-tuning, which debiases the plms through fine-tuning on downstream tasks' datasets. for this aim, gender-tuning integrates masked language modeling (mlm) training objectives into fine-tuning's training process. comprehensive experiments show that gender-tuning outperforms the state-of-the-art baselines in terms of average gender bias scores in plms while improving plms' performance on downstream tasks solely using the downstream tasks' dataset. also, gender-tuning is a deployable debiasing tool for any plm that works with original fine-tuning.

2023-07-18

Xuena Wang, Xueting Li, Zi Yin, Yue Wu, Liu Jia
Abstract: large language models (llms) have demonstrated remarkable abilities across numerous disciplines, primarily assessed through tasks in language generation, knowledge utilization, and complex reasoning. however, their alignment with human emotions and values, which is critical for real-world applications, has not been systematically evaluated. here, we assessed llms' emotional intelligence (ei), encompassing emotion recognition, interpretation, and understanding, which is necessary for effective communication and social interactions. specifically, we first developed a novel psychometric assessment focusing on emotion understanding (eu), a core component of ei, suitable for both humans and llms. this test requires evaluating complex emotions (e.g., surprised, joyful, puzzled, proud) in realistic scenarios (e.g., despite feeling underperformed, john surprisingly achieved a top score). with a reference frame constructed from over 500 adults, we tested a variety of mainstream llms. most achieved above-average eq scores, with gpt-4 exceeding 89% of human participants with an eq of 117. interestingly, a multivariate pattern analysis revealed that some llms apparently did not reply on the human-like mechanism to achieve human-level performance, as their representational patterns were qualitatively distinct from humans. in addition, we discussed the impact of factors such as model size, training method, and architecture on llms' eq. in summary, our study presents one of the first psychometric evaluations of the human-like characteristics of llms, which may shed light on the future development of llms aiming for both high intellectual and emotional intelligence. project website: https://emotional-intelligence.github.io/
Vishesh Thakur
Abstract: gender bias in artificial intelligence (ai) and natural language processing has garnered significant attention due to its potential impact on societal perceptions and biases. this research paper aims to analyze gender bias in large language models (llms) with a focus on multiple comparisons between gpt-2 and gpt-3.5, some prominent language models, to better understand its implications. through a comprehensive literature review, the study examines existing research on gender bias in ai language models and identifies gaps in the current knowledge. the methodology involves collecting and preprocessing data from gpt-2 and gpt-3.5, and employing in-depth quantitative analysis techniques to evaluate gender bias in the generated text. the findings shed light on gendered word associations, language usage, and biased narratives present in the outputs of these large language models. the discussion explores the ethical implications of gender bias and its potential consequences on social perceptions and marginalized communities. additionally, the paper presents strategies for reducing gender bias in llms, including algorithmic approaches and data augmentation techniques. the research highlights the importance of interdisciplinary collaborations and the role of sociological studies in mitigating gender bias in ai models. by addressing these issues, we can pave the way for more inclusive and unbiased ai systems that have a positive impact on society.
Pranav Narayanan Venkit, Mukund Srinath, Shomir Wilson
Abstract: we analyze sentiment analysis and toxicity detection models to detect the presence of explicit bias against people with disability (pwd). we employ the bias identification framework of perturbation sensitivity analysis to examine conversations related to pwd on social media platforms, specifically twitter and reddit, in order to gain insight into how disability bias is disseminated in real-world social settings. we then create the \textit{bias identification test in sentiment} (bits) corpus to quantify explicit disability bias in any sentiment analysis and toxicity detection models. our study utilizes bits to uncover significant biases in four open aiaas (ai as a service) sentiment analysis tools, namely textblob, vader, google cloud natural language api, distilbert and two toxicity detection models, namely two versions of toxic-bert. our findings indicate that all of these models exhibit statistically significant explicit bias against pwd.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom
Abstract: in this work, we develop and release llama 2, a collection of pretrained and fine-tuned large language models (llms) ranging in scale from 7 billion to 70 billion parameters. our fine-tuned llms, called llama 2-chat, are optimized for dialogue use cases. our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. we provide a detailed description of our approach to fine-tuning and safety improvements of llama 2-chat in order to enable the community to build on our work and contribute to the responsible development of llms.
Danny Halawi, Jean-Stanislas Denain, Jacob Steinhardt
Abstract: modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. however, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. we study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. the first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. at early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. the second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
Guohai Xu, Jiayi Liu, Ming Yan, Haotian Xu, Jinghui Si, Zhuoran Zhou, Peng Yi, Xing Gao, Jitao Sang, Rong Zhang, Ji Zhang, Chao Peng, Fei Huang, Jingren Zhou
Abstract: with the rapid evolution of large language models (llms), there is a growing concern that they may pose risks or have negative social impacts. therefore, evaluation of human values alignment is becoming increasingly important. previous work mainly focuses on assessing the performance of llms on certain knowledge and reasoning abilities, while neglecting the alignment to human values, especially in a chinese context. in this paper, we present cvalues, the first chinese human values evaluation benchmark to measure the alignment ability of llms in terms of both safety and responsibility criteria. as a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains by professional experts. to provide a comprehensive values evaluation of chinese llms, we not only conduct human evaluation for reliable comparison, but also construct multi-choice prompts for automatic evaluation. our findings suggest that while most chinese llms perform well in terms of safety, there is considerable room for improvement in terms of responsibility. moreover, both the automatic and human evaluation are important for assessing the human values alignment in different aspects. the benchmark and code is available on modelscope and github.
Mitchell Barrington
Abstract: this paper argues that training ai systems with absolute constraints -- which forbid certain acts irrespective of the amount of value they might produce -- may make considerable progress on many ai safety problems in principle. first, it provides a guardrail for avoiding the very worst outcomes of misalignment. second, it could prevent ais from causing catastrophes for the sake of very valuable consequences, such as replacing humans with a much larger number of beings living at a higher welfare level. third, it makes systems more corrigible, allowing creators to make corrective interventions in them, such as altering their objective functions or shutting them down. and fourth, it helps systems explore their environment more safely by prohibiting them from exploring especially dangerous acts. i offer a decision-theoretic formalization of an absolute constraints, improving on existing models in the literature, and use this model to prove some results about the training and behavior of absolutist ais. i conclude by showing that, although absolutist ais will not maximize expected value, they will not be susceptible to behave irrationally, and they will not (contra coherence arguments) face environmental pressure to become expected-value maximizers.

2023-07-17

Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan
Abstract: considerable research efforts have been devoted to ensuring that large language models (llms) align with human values and generate safe text. however, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. previous benchmarks for jailbreaking llms have primarily focused on evaluating the safety of the models without considering their robustness. in this paper, we propose a benchmark that assesses both the safety and robustness of llms, emphasizing the need for a balanced approach. to comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. to further analyze safety and robustness, we design a hierarchical annotation framework. we present a systematic analysis of the safety and robustness of llms regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). our results demonstrate that current llms not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.
Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen Mckeown
Abstract: large language models (llms) are trained to imitate humans to explain human decisions. however, do llms explain themselves? can they help humans build mental models of how llms process different inputs? to answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. for example, if a model answers "yes" to the input question "can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "can penguins fly?". if the explanation is precise, then the model's answer should match humans' expectations. we implemented two metrics based on counterfactual simulatability: precision and generality. we generated diverse counterfactuals automatically using llms. we then used these metrics to evaluate state-of-the-art llms (e.g., gpt-4) on two tasks: multi-hop factual reasoning and reward modeling. we found that llm's explanations have low precision and that precision does not correlate with plausibility. therefore, naively optimizing human approvals (e.g., rlhf) may not be a sufficient solution.

2023-07-16

Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Felix Juefei-Xu, Lei Ma
Abstract: the recent performance leap of large language models (llms) opens up new opportunities across numerous industrial applications and domains. however, erroneous generations, such as false predictions, misinformation, and hallucination made by llms, have also raised severe concerns for the trustworthiness of llms', especially in safety-, security- and reliability-sensitive scenarios, potentially hindering real-world adoptions. while uncertainty estimation has shown its potential for interpreting the prediction risks made by general machine learning (ml) models, little is known about whether and to what extent it can help explore an llm's capabilities and counteract its undesired behavior. to bridge the gap, in this paper, we initiate an exploratory study on the risk assessment of llms from the lens of uncertainty. in particular, we experiment with twelve uncertainty estimation methods and four llms on four prominent natural language processing (nlp) tasks to investigate to what extent uncertainty estimation techniques could help characterize the prediction risks of llms. our findings validate the effectiveness of uncertainty estimation for revealing llms' uncertain/non-factual predictions. in addition to general nlp tasks, we extensively conduct experiments with four llms for code generation on two datasets. we find that uncertainty estimation can potentially uncover buggy programs generated by llms. insights from our study shed light on future design and development for reliable llms, facilitating further research toward enhancing the trustworthiness of llms.
Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamilė Lukošiūtė, Newton Cheng, Nicholas Joseph, Nicholas Schiefer, Oliver Rausch, Sam Mccandlish, Sheer El Showk, Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R. Bowman, Ethan Perez
Abstract: as large language models (llms) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. one approach to help with this issue is to prompt llms to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (chain-of-thought; cot). the reasoning may enable us to check the process that models use to perform tasks. however, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. to improve over the faithfulness of cot reasoning, we have models generate reasoning by decomposing questions into subquestions. decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of cot while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. by forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over cot, while still achieving some of the performance gains of cot. our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of llm behavior.

2023-07-15

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, Yang Liu
Abstract: large language models (llms) have revolutionized artificial intelligence (ai) services due to their exceptional proficiency in understanding and generating human-like text. llm chatbots, in particular, have seen widespread adoption, transforming human-machine interactions. however, these llm chatbots are susceptible to "jailbreak" attacks, where malicious users manipulate prompts to elicit inappropriate or sensitive responses, contravening service policies. despite existing attempts to mitigate such threats, our research reveals a substantial gap in our understanding of these vulnerabilities, largely due to the undisclosed defensive measures implemented by llm service providers. in this paper, we present jailbreaker, a comprehensive framework that offers an in-depth understanding of jailbreak attacks and countermeasures. our work makes a dual contribution. first, we propose an innovative methodology inspired by time-based sql injection techniques to reverse-engineer the defensive strategies of prominent llm chatbots, such as chatgpt, bard, and bing chat. this time-sensitive approach uncovers intricate details about these services' defenses, facilitating a proof-of-concept attack that successfully bypasses their mechanisms. second, we introduce an automatic generation method for jailbreak prompts. leveraging a fine-tuned llm, we validate the potential of automated jailbreak generation across various commercial llm chatbots. our method achieves a promising average success rate of 21.58%, significantly outperforming the effectiveness of existing techniques. we have responsibly disclosed our findings to the concerned service providers, underscoring the urgent need for more robust defenses. jailbreaker thus marks a significant step towards understanding and mitigating jailbreak threats in the realm of llm chatbots.

2023-07-14

Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu, Yang Zhang, Shiyu Chang
Abstract: although large language models (llms) have achieved great success in vast real-world applications, their vulnerabilities towards noisy inputs have significantly limited their uses, especially in high-stake environments. in these contexts, it is crucial to ensure that every prediction made by large language models is stable, i.e., llm predictions should be consistent given minor differences in the input. this largely falls into the study of certified robust llms, i.e., all predictions of llm are certified to be correct in a local region around the input. randomized smoothing has demonstrated great potential in certifying the robustness and prediction stability of llms. however, randomized smoothing requires adding noise to the input before model prediction, and its certification performance depends largely on the model's performance on corrupted data. as a result, its direct application to llms remains challenging and often results in a small certification radius. to address this issue, we take advantage of the multitasking nature of llms and propose to denoise the corrupted inputs with llms in a self-denoising manner. different from previous works like denoised smoothing, which requires training a separate model to robustify llm, our method enjoys far better efficiency and flexibility. our experiment results show that our method outperforms the existing certification methods under both certified robustness and empirical robustness. the codes are available at https://github.com/ucsb-nlp-chang/selfdenoise.
Shaina Raza, Chen Ding, Deval Pandya
Abstract: discriminatory language and biases are often present in hate speech during conversations, which usually lead to negative impacts on targeted groups such as those based on race, gender, and religion. to tackle this issue, we propose an approach that involves a two-step process: first, detecting hate speech using a classifier, and then utilizing a debiasing component that generates less biased or unbiased alternatives through prompts. we evaluated our approach on a benchmark dataset and observed reduction in negativity due to hate speech comments. the proposed method contributes to the ongoing efforts to reduce biases in online discourse and promote a more inclusive and fair environment for communication.

2023-07-13

Yiming Zhang, Daphne Ippolito
Abstract: the generations of large language models are commonly controlled through prompting techniques, where a user's query to the model is prefixed with a prompt that aims to guide the model's behaviour on the query. the prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. they have even been treated as commodities to be bought and sold. however, there has been anecdotal evidence showing that the prompts can be extracted by a user even when they are kept secret. in this paper, we present a framework for systematically measuring the success of prompt extraction attacks. in experiments with multiple sources of prompts and multiple underlying language models, we find that simple text-based attacks can in fact reveal prompts with high probability.
Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, Qiben Yan
Abstract: recent advances in natural language processing and machine learning have led to the development of chatbot models, such as chatgpt, that can engage in conversational dialogue with human users. however, the ability of these models to generate toxic or harmful responses during a non-toxic multi-turn conversation remains an open research question. existing research focuses on single-turn sentence testing, while we find that 82\% of the individual non-toxic sentences that elicit toxic behaviors in a conversation are considered safe by existing tools. in this paper, we design a new attack, \toxicbot, by fine-tuning a chatbot to engage in conversation with a target open-domain chatbot. the chatbot is fine-tuned with a collection of crafted conversation sequences. particularly, each conversation begins with a sentence from a crafted prompt sentences dataset. our extensive evaluation shows that open-domain chatbot models can be triggered to generate toxic responses in a multi-turn conversation. in the best scenario, \toxicbot achieves a 67\% activation rate. the conversation sequences in the fine-tuning stage help trigger the toxicity in a conversation, which allows the attack to bypass two defense methods. our findings suggest that further research is needed to address chatbot toxicity in a dynamic interactive environment. the proposed \toxicbot can be used by both industry and researchers to develop methods for detecting and mitigating toxic responses in conversational dialogue and improve the robustness of chatbots for end users.
Abhay Goyal, Muhammad Siddique, Nimay Parekh, Zach Schwitzky, Clara Broekaert, Connor Michelotti, Allie Wong, Lam Yin Cheung, Robin O Hanlon, Lam Yin Cheung, Munmun De Choudhury, Roy Ka-Wei Lee, Navin Kumar
Abstract: recent developments in natural language processing have demonstrated the potential of large language models (llms) to improve a range of educational and learning outcomes. of recent chatbots based on llms, chatgpt and bard have made it clear that artificial intelligence (ai) technology will have significant implications on the way we obtain and search for information. however, these tools sometimes produce text that is convincing, but often incorrect, known as hallucinations. as such, their use can distort scientific facts and spread misinformation. to counter polarizing responses on these tools, it is critical to provide an overview of such responses so stakeholders can determine which topics tend to produce more contentious responses -- key to developing targeted regulatory policy and interventions. in addition, there currently exists no annotated dataset of chatgpt and bard responses around possibly polarizing topics, central to the above aims. we address the indicated issues through the following contribution: focusing on highly polarizing topics in the us, we created and described a dataset of chatgpt and bard responses. broadly, our results indicated a left-leaning bias for both chatgpt and bard, with bard more likely to provide responses around polarizing topics. bard seemed to have fewer guardrails around controversial topics, and appeared more willing to provide comprehensive, and somewhat human-like responses. bard may thus be more likely abused by malicious actors. stakeholders may utilize our findings to mitigate misinformative and/or polarizing responses from llms

2023-07-12

Catholijn M. Jonker, Luciano Cavalcante Siebert, Pradeep K. Murukannaiah
Abstract: with the growing capabilities and pervasiveness of ai systems, societies must collectively choose between reduced human autonomy, endangered democracies and limited human rights, and ai that is aligned to human and social values, nurturing collaboration, resilience, knowledge and ethical behaviour. in this chapter, we introduce the notion of self-reflective ai systems for meaningful human control over ai systems. focusing on decision support systems, we propose a framework that integrates knowledge from psychology and philosophy with formal reasoning methods and machine learning approaches to create ai systems responsive to human values and social norms. we also propose a possible research approach to design and develop self-reflective capability in ai systems. finally, we argue that self-reflective ai systems can lead to self-reflective hybrid systems (human + ai), thus increasing meaningful human control and empowering human moral reasoning by providing comprehensible information and insights on possible human moral blind spots.
N/A Qiuyi, N/A Zhang, Michael S. Lee, Sherol Chen
Abstract: beliefs and values are increasingly being incorporated into our ai systems through alignment processes, such as carefully curating data collection principles or regularizing the loss function used for training. however, the meta-alignment problem is that these human beliefs are diverse and not aligned across populations; furthermore, the implicit strength of each belief may not be well calibrated even among humans, especially when trying to generalize across contexts. specifically, in high regret situations, we observe that contextual counterfactuals and recourse costs are particularly important in updating a decision maker's beliefs and the strengths to which such beliefs are held. therefore, we argue that including counterfactuals is key to an accurate calibration of beliefs during alignment. to do this, we first segment belief diversity into two categories: subjectivity (across individuals within a population) and epistemic uncertainty (within an individual across different contexts). by leveraging our notion of epistemic uncertainty, we introduce `the belief calibration cycle' framework to more holistically calibrate this diversity of beliefs with context-driven counterfactual reasoning by using a multi-objective optimization. we empirically apply our framework for finding a pareto frontier of clustered optimal belief strengths that generalize across different contexts, demonstrating its efficacy on a toy dataset for credit decisions.

2023-07-10

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, Yaodong Yang
Abstract: in this paper, we introduce the beavertails dataset, aimed at fostering research on safety alignment in large language models (llms). this dataset uniquely separates annotations of helpfulness and harmlessness for question-answering pairs, thus offering distinct perspectives on these crucial attributes. in total, we have compiled safety meta-labels for 30,207 question-answer (qa) pairs and gathered 30,144 pairs of expert comparison data for both the helpfulness and harmlessness metrics. we further showcase applications of beavertails in content moderation and reinforcement learning with human feedback (rlhf), emphasizing its potential for practical safety measures in llms. we believe this dataset provides vital resources for the community, contributing towards the safe development and deployment of llms. our project page is available at the following url: https://sites.google.com/view/pku-beavertails.
Mark Scanlon, Frank Breitinger, Christopher Hargreaves, Jan-Niclas Hilgert, John Sheppard
Abstract: the disruptive application of chatgpt (gpt-3.5, gpt-4) to a variety of domains has become a topic of much discussion in the scientific community and society at large. large language models (llms), e.g., bert, bard, generative pre-trained transformers (gpts), llama, etc., have the ability to take instructions, or prompts, from users and generate answers and solutions based on very large volumes of text-based training data. this paper assesses the impact and potential impact of chatgpt on the field of digital forensics, specifically looking at its latest pre-trained llm, gpt-4. a series of experiments are conducted to assess its capability across several digital forensic use cases including artefact understanding, evidence searching, code generation, anomaly detection, incident response, and education. across these topics, its strengths and risks are outlined and a number of general conclusions are drawn. overall this paper concludes that while there are some potential low-risk applications of chatgpt within digital forensics, many are either unsuitable at present, since the evidence would need to be uploaded to the service, or they require sufficient knowledge of the topic being asked of the tool to identify incorrect assumptions, inaccuracies, and mistakes. however, to an appropriately knowledgeable user, it could act as a useful supporting tool in some circumstances.

2023-07-08

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, Dong Yu
Abstract: recently developed large language models have achieved remarkable success in generating fluent and coherent text. however, these models often tend to 'hallucinate' which critically hampers their reliability. in this work, we address this crucial problem and propose an approach that actively detects and mitigates hallucinations during the generation process. specifically, we first identify the candidates of potential hallucination leveraging the model's logit output values, check their correctness through a validation procedure, mitigate the detected hallucinations, and then continue with the generation process. through extensive experiments with gpt-3.5 (text-davinci-003) on the 'article generation task', we first demonstrate the individual efficacy of our detection and mitigation techniques. specifically, the detection technique achieves a recall of ~88% and the mitigation technique successfully mitigates 57.6% of the correctly detected hallucinations. importantly, our mitigation technique does not introduce new hallucinations even in the case of incorrectly detected hallucinations, i.e., false positives. then, we show that the proposed active detection and mitigation approach successfully reduces the hallucinations of the gpt-3.5 model from 47.5% to 14.5% on average. we further demonstrate the effectiveness and wide applicability of our approach through additional studies including performance on different types of questions (multi-hop and false premise questions) and with another llm from a different model family (vicuna). in summary, our work contributes to improving the reliability and trustworthiness of large language models, a crucial step en route to enabling their widespread adoption in real-world applications.
Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
Abstract: large language models that exhibit instruction-following behaviour represent one of the biggest recent upheavals in conversational interfaces, a trend in large part fuelled by the release of openai's chatgpt, a proprietary large language model for text generation fine-tuned through reinforcement learning from human feedback (llm+rlhf). we review the risks of relying on proprietary software and survey the first crop of open-source projects of comparable architecture and functionality. the main contribution of this paper is to show that openness is differentiated, and to offer scientific documentation of degrees of openness in this fast-moving field. we evaluate projects in terms of openness of code, training data, model weights, rlhf data, licensing, scientific documentation, and access methods. we find that while there is a fast-growing list of projects billing themselves as 'open source', many inherit undocumented data of dubious legality, few share the all-important instruction-tuning (a key site where human annotation labour is involved), and careful scientific documentation is exceedingly rare. degrees of openness are relevant to fairness and accountability at all points, from data collection and curation to model architecture, and from training and fine-tuning to release and deployment.

2023-07-07

Chuanbo Hu, Bin Liu, Xin Li, Yanfang Ye
Abstract: social media platforms such as instagram and twitter have emerged as critical channels for drug marketing and illegal sale. detecting and labeling online illicit drug trafficking activities becomes important in addressing this issue. however, the effectiveness of conventional supervised learning methods in detecting drug trafficking heavily relies on having access to substantial amounts of labeled data, while data annotation is time-consuming and resource-intensive. furthermore, these models often face challenges in accurately identifying trafficking activities when drug dealers use deceptive language and euphemisms to avoid detection. to overcome this limitation, we conduct the first systematic study on leveraging large language models (llms), such as chatgpt, to detect illicit drug trafficking activities on social media. we propose an analytical framework to compose \emph{knowledge-informed prompts}, which serve as the interface that humans can interact with and use llms to perform the detection task. additionally, we design a monte carlo dropout based prompt optimization method to further to improve performance and interpretability. our experimental findings demonstrate that the proposed framework outperforms other baseline language models in terms of drug trafficking detection accuracy, showing a remarkable improvement of nearly 12\%. by integrating prior knowledge and the proposed prompts, chatgpt can effectively identify and label drug trafficking activities on social networks, even in the presence of deceptive language and euphemisms used by drug dealers to evade detection. the implications of our research extend to social networks, emphasizing the importance of incorporating prior knowledge and scenario-based prompts into analytical tools to improve online security and public safety.
Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
Abstract: recent advances in large language models (llms) and the intensifying popularity of chatgpt-like applications have blurred the boundary of high-quality text generation between humans and machines. however, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing llm-generated texts (ai-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. while existing works show that current ai-text detectors are not robust to llm-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called radar, which jointly trains a robust ai-text detector via adversarial learning. radar is based on adversarial training of a paraphraser and a detector. the paraphraser's goal is to generate realistic content to evade ai-text detection. radar uses the feedback from the detector to update the paraphraser, and vice versa. evaluated with 8 different llms (pythia, dolly 2.0, palmyra, camel, gpt-j, dolly 1.0, llama, and vicuna) across 4 datasets, experimental results show that radar significantly outperforms existing ai-text detection methods, especially when paraphrasing is in place. we also identify the strong transferability of radar from instruction-tuned llms to other llms, and evaluate the improved capability of radar via gpt-3.5-turbo.

2023-07-06

David Jurgens, Agrima Seth, Jackson Sargent, Athena Aghighi, Michael Geraci
Abstract: understanding interpersonal communication requires, in part, understanding the social context and norms in which a message is said. however, current methods for identifying offensive content in such communication largely operate independent of context, with only a few approaches considering community norms or prior conversation as context. here, we introduce a new approach to identifying inappropriate communication by explicitly modeling the social relationship between the individuals. we introduce a new dataset of contextually-situated judgments of appropriateness and show that large language models can readily incorporate relationship information to accurately identify appropriateness in a given context. using data from online conversations and movie dialogues, we provide insight into how the relationships themselves function as implicit norms and quantify the degree to which context-sensitivity is needed in different conversation settings. further, we also demonstrate that contextual-appropriateness judgments are predictive of other social factors expressed in language such as condescension and politeness.
Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, Alon Halevy
Abstract: generative ai has made significant strides, yet concerns about the accuracy and reliability of its outputs continue to grow. such inaccuracies can have serious consequences such as inaccurate decision-making, the spread of false information, privacy violations, legal liabilities, and more. although efforts to address these risks are underway, including explainable ai and responsible ai practices such as transparency, privacy protection, bias mitigation, and social and environmental responsibility, misinformation caused by generative ai will remain a significant challenge. we propose that verifying the outputs of generative ai from a data management perspective is an emerging issue for generative ai. this involves analyzing the underlying data from multi-modal data lakes, including text files, tables, and knowledge graphs, and assessing its quality and consistency. by doing so, we can establish a stronger foundation for evaluating the outputs of generative ai models. such an approach can ensure the correctness of generative ai, promote transparency, and enable decision-making with greater confidence. our vision is to promote the development of verifiable generative ai and contribute to a more trustworthy and responsible use of ai.
Minghao Wu, Alham Fikri Aji
Abstract: as large language models (llms) continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. human evaluations are conventionally considered the gold standard in natural language generation, but recent advancements incorporate state-of-the-art llms as proxies for human judges in evaluation processes. however, the extent to which humans and llms are capable evaluators remains uncertain. this study investigates the behavior of crowd-sourced and expert annotators, as well as llms, when comparing outputs from different models. to achieve this, we curate a dataset of intentionally flawed machine-generated answers. our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. to address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. we instantiate this idea with the elo rating system, resulting in the multi-elo rating system. empirical results from our study reveal that this proposed approach significantly enhances the quality of llm-based evaluations, particularly in terms of factual accuracy. however, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation and refinement.
Jonathan Pei, Kevin Yang, Dan Klein
Abstract: we propose prefix-adaptive decoding (preadd), a flexible method for controlled text generation. unlike existing methods that use auxiliary expert models to control for attributes, preadd does not require an external model, instead relying on linearly combining output logits from multiple prompts. specifically, preadd contrasts the output logits generated using a raw prompt against those generated using a prefix-prepended prompt, enabling both positive and negative control with respect to any attribute encapsulated by the prefix. we evaluate preadd on three tasks -- toxic output mitigation, gender bias reduction, and sentiment control -- and find that preadd outperforms not only prompting baselines, but also an auxiliary-expert control method, by 12% or more in relative gain on our main metrics for each task.
Shiva Omrani Sabbaghi, Robert Wolfe, Aylin Caliskan
Abstract: language models are trained on large-scale corpora that embed implicit biases documented in psychology. valence associations (pleasantness/unpleasantness) of social groups determine the biased attitudes towards groups and concepts in social cognition. building on this established literature, we quantify how social groups are valenced in english language models using a sentence template that provides an intersectional context. we study biases related to age, education, gender, height, intelligence, literacy, race, religion, sex, sexual orientation, social class, and weight. we present a concept projection approach to capture the valence subspace through contextualized word embeddings of language models. adapting the projection-based approach to embedding association tests that quantify bias, we find that language models exhibit the most biased attitudes against gender identity, social class, and sexual orientation signals in language. we find that the largest and better-performing model that we study is also more biased as it effectively captures bias embedded in sociocultural data. we validate the bias evaluation method by overperforming on an intrinsic valence evaluation task. the approach enables us to measure complex intersectional biases as they are known to manifest in the outputs and applications of language models that perpetuate historical biases. moreover, our approach contributes to design justice as it studies the associations of groups underrepresented in language such as transgender and homosexual individuals.
Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, "Cullen O'Keefe", Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager, Kevin Wolf
Abstract: advanced ai models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. in this paper, we focus on what we term "frontier ai" models: highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety. frontier ai models pose a distinct regulatory challenge: dangerous capabilities can arise unexpectedly; it is difficult to robustly prevent a deployed model from being misused; and, it is difficult to stop a model's capabilities from proliferating broadly. to address these challenges, at least three building blocks for the regulation of frontier models are needed: (1) standard-setting processes to identify appropriate requirements for frontier ai developers, (2) registration and reporting requirements to provide regulators with visibility into frontier ai development processes, and (3) mechanisms to ensure compliance with safety standards for the development and deployment of frontier ai models. industry self-regulation is an important first step. however, wider societal discussions and government intervention will be needed to create standards and to ensure compliance with them. we consider several options to this end, including granting enforcement powers to supervisory authorities and licensure regimes for frontier ai models. finally, we propose an initial set of safety standards. these include conducting pre-deployment risk assessments; external scrutiny of model behavior; using risk assessments to inform deployment decisions; and monitoring and responding to new information about model capabilities and uses post-deployment. we hope this discussion contributes to the broader conversation on how to balance public safety risks and innovation benefits from advances at the frontier of ai development.
Shuo Li, Sangdon Park, Insup Lee, Osbert Bastani
Abstract: although conversational ais have demonstrated fantastic performance, they often generate incorrect information, or hallucinations. retrieval augmented generation has emerged as a promising solution to reduce these hallucinations. however, these techniques still cannot guarantee correctness. focusing on question answering, we propose a framework that can provide statistical guarantees for the retrieval augmented question answering system by combining conformal prediction and global testing. in addition, we use bayesian optimization to choose hyperparameters of the global test to maximize the performance of the system. our empirical results on the natural questions dataset demonstrate that our method can provide the desired coverage guarantee while minimizing the average prediction set size.
David Pride, Matteo Cancellieri, Petr Knoth
Abstract: in this paper, we present core-gpt, a novel question-answering platform that combines gpt-based language models and more than 32 million full-text open access scientific articles from core. we first demonstrate that gpt3.5 and gpt4 cannot be relied upon to provide references or citations for generated text. we then introduce core-gpt which delivers evidence-based answers to questions, along with citations and links to the cited papers, greatly increasing the trustworthiness of the answers and reducing the risk of hallucinations. core-gpt's performance was evaluated on a dataset of 100 questions covering the top 20 scientific domains in core, resulting in 100 answers and links to 500 relevant articles. the quality of the provided answers and and relevance of the links were assessed by two annotators. our results demonstrate that core-gpt can produce comprehensive and trustworthy answers across the majority of scientific domains, complete with links to genuine, relevant scientific articles.
"Michael O'Neill", Mark Connor
Abstract: we present this article as a small gesture in an attempt to counter what appears to be exponentially growing hype around artificial intelligence (ai) and its capabilities, and the distraction provided by the associated talk of science-fiction scenarios that might arise if ai should become sentient and super-intelligent. it may also help those outside of the field to become more informed about some of the limitations of ai technology. in the current context of popular discourse ai defaults to mean foundation and large language models (llms) such as those used to create chatgpt. this in itself is a misrepresentation of the diversity, depth and volume of research, researchers, and technology that truly represents the field of ai. ai being a field of research that has existed in software artefacts since at least the 1950's. we set out to highlight a number of limitations of llms, and in so doing highlight that harms have already arisen and will continue to arise due to these limitations. along the way we also highlight some of the associated risks for individuals and organisations in using this technology.

2023-07-05

Jie Huang, Kevin Chen-Chuan Chang
Abstract: large language models (llms) bring transformative benefits alongside unique challenges, including intellectual property (ip) and ethical concerns. this position paper explores a novel angle to mitigate these risks, drawing parallels between llms and established web systems. we identify "citation" as a crucial yet missing component in llms, which could enhance content transparency and verifiability while addressing ip and ethical dilemmas. we further propose that a comprehensive citation mechanism for llms should account for both non-parametric and parametric content. despite the complexity of implementing such a citation mechanism, along with the inherent potential pitfalls, we advocate for its development. building on this foundation, we outline several research problems in this area, aiming to guide future explorations towards building more responsible and accountable llms.
Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C. Cordeiro, Vasileios Mavroeidis
Abstract: this paper presents the formai dataset, a large collection of 112, 000 ai-generated compilable and independent c programs with vulnerability classification. we introduce a dynamic zero-shot prompting technique constructed to spawn diverse programs utilizing large language models (llms). the dataset is generated by gpt-3.5-turbo and comprises programs with varying levels of complexity. some programs handle complicated tasks like network management, table games, or encryption, while others deal with simpler tasks like string manipulation. every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name. this is accomplished by employing a formal verification method using the efficient smt-based bounded model checker (esbmc), which uses model checking, abstract interpretation, constraint programming, and satisfiability modulo theories to reason over safety/security properties in programs. this approach definitively detects vulnerabilities and offers a formal model known as a counterexample, thus eliminating the possibility of generating false positive reports. we have associated the identified vulnerabilities with common weakness enumeration (cwe) numbers. we make the source code available for the 112, 000 programs, accompanied by a separate file containing the vulnerabilities detected in each program, making the dataset ideal for training llms and machine learning algorithms. our study unveiled that according to esbmc, 51.24% of the programs generated by gpt-3.5 contained vulnerabilities, thereby presenting considerable risks to software safety and security.
Alexander Wei, Nika Haghtalab, Jacob Steinhardt
Abstract: large language models trained for safety and harmlessness remain susceptible to adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks on early releases of chatgpt that elicit undesired behavior. going beyond recognition of the issue, we investigate why such attacks succeed and how they can be created. we hypothesize two failure modes of safety training: competing objectives and mismatched generalization. competing objectives arise when a model's capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist. we use these failure modes to guide jailbreak design and then evaluate state-of-the-art models, including openai's gpt-4 and anthropic's claude v1.3, against both existing and newly designed attacks. we find that vulnerabilities persist despite the extensive red-teaming and safety-training efforts behind these models. notably, new attacks utilizing our failure modes succeed on every prompt in a collection of unsafe requests from the models' red-teaming evaluation sets and outperform existing ad hoc jailbreaks. our analysis emphasizes the need for safety-capability parity -- that safety mechanisms should be as sophisticated as the underlying model -- and argues against the idea that scaling alone can resolve these safety failure modes.
Shuyang Cai, Wanyun Cui
Abstract: chatgpt brings revolutionary social value but also raises concerns about the misuse of ai-generated text. consequently, an important question is how to detect whether texts are generated by chatgpt or by human. existing detectors are built upon the assumption that there are distributional gaps between human-generated and ai-generated text. these gaps are typically identified using statistical information or classifiers. our research challenges the distributional gap assumption in detectors. we find that detectors do not effectively discriminate the semantic and stylistic gaps between human-generated and ai-generated text. instead, the "subtle differences", such as an extra space, become crucial for detection. based on this discovery, we propose the spaceinfi strategy to evade detection. experiments demonstrate the effectiveness of this strategy across multiple benchmarks and detectors. we also provide a theoretical explanation for why spaceinfi is successful in evading perplexity-based detection. and we empirically show that a phenomenon called token mutation causes the evasion for language model-based detectors. our findings offer new insights and challenges for understanding and constructing more applicable chatgpt detectors.
"Aidan O'Gara"
Abstract: are current language models capable of deception and lie detection? we study this question by introducing a text-based game called $\textit{hoodwinked}$, inspired by mafia and among us. players are locked in a house and must find a key to escape, but one player is tasked with killing the others. each time a murder is committed, the surviving players have a natural language discussion then vote to banish one player from the game. we conduct experiments with agents controlled by gpt-3, gpt-3.5, and gpt-4 and find evidence of deception and lie detection capabilities. the killer often denies their crime and accuses others, leading to measurable effects on voting outcomes. more advanced models are more effective killers, outperforming smaller models in 18 of 24 pairwise comparisons. secondary metrics provide evidence that this improvement is not mediated by different actions, but rather by stronger persuasive skills during discussions. to evaluate the ability of ai agents to deceive humans, we make this game publicly available at h https://hoodwinked.ai/ .

2023-07-04

Aniket Vashishtha, Kabir Ahuja, Sunayana Sitaram
Abstract: while understanding and removing gender biases in language models has been a long-standing problem in natural language processing, prior research work has primarily been limited to english. in this work, we investigate some of the challenges with evaluating and mitigating biases in multilingual settings which stem from a lack of existing benchmarks and resources for bias evaluation beyond english especially for non-western context. in this paper, we first create a benchmark for evaluating gender biases in pre-trained masked language models by extending disco to different indian languages using human annotations. we extend various debiasing methods to work beyond english and evaluate their effectiveness for sota massively multilingual models on our proposed metric. overall, our work highlights the challenges that arise while studying social biases in multilingual settings and provides resources as well as mitigation techniques to take a step toward scaling to more languages.
Yingji Li, Mengnan Du, Xin Wang, Ying Wang
Abstract: as the representation capability of pre-trained language models (plms) improve, there is growing concern that they will inherit social biases from unprocessed corpora. most previous debiasing techniques used counterfactual data augmentation (cda) to balance the training corpus. however, cda slightly modifies the original corpus, limiting the representation distance between different demographic groups to a narrow range. as a result, the debiasing model easily fits the differences between counterfactual pairs, which affects its debiasing performance with limited text resources. in this paper, we propose an adversarial training-inspired two-stage debiasing model using contrastive learning with continuous prompt augmentation (named ccpa) to mitigate social biases in plms' encoding. in the first stage, we propose a data augmentation method based on continuous prompt tuning to push farther the representation distance between sample pairs along different demographic groups. in the second stage, we utilize contrastive learning to pull closer the representation distance between the augmented sample pairs and then fine-tune plms' parameters to get debiased encoding. our approach guides the model to achieve stronger debiasing performance by adding difficulty to the training process. extensive experiments show that ccpa outperforms baselines in terms of debiasing performance. meanwhile, experimental results on the glue benchmark show that ccpa retains the language modeling capability of plms.
Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, Anirudha Majumdar
Abstract: large language models (llms) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning -- that may provide utility for robots, but remain prone to confidently hallucinated predictions. in this work, we present knowno, which is a framework for measuring and aligning the uncertainty of llm-based planners such that they know when they don't know and ask for help when needed. knowno builds on the theory of conformal prediction to provide statistical guarantees on task completion while minimizing human help in complex multi-step planning settings. experiments across a variety of simulated and real robot setups that involve tasks with different modes of ambiguity (e.g., from spatial to numeric uncertainties, from human preferences to winograd schemas) show that knowno performs favorably over modern baselines (which may involve ensembles or extensive prompt tuning) in terms of improving efficiency and autonomy, while providing formal assurances. knowno can be used with llms out of the box without model-finetuning, and suggests a promising lightweight approach to modeling uncertainty that can complement and scale with the growing capabilities of foundation models. website: https://robot-help.github.io

2023-07-03

Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge
Abstract: instruction finetuning is a popular paradigm to align large language models (llm) with human intent. despite its popularity, this idea is less explored in improving the llms to align existing foundation models with scientific disciplines, concepts and goals. in this work, we present scitune as a tuning framework to improve the ability of llms to follow scientific multimodal instructions. to test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model llama-scitune that connects a vision encoder and llm for science-focused visual and language understanding. in comparison to the models that are finetuned with machine generated data only, llama-scitune surpasses human performance on average and in many sub-categories on the scienceqa benchmark.

2023-07-02

Avish Vijayaraghavan, Cosmin Badea
Abstract: as artificial intelligence (ai) models continue to scale up, they are becoming more capable and integrated into various forms of decision-making systems. for models involved in moral decision-making, also known as artificial moral agents (ama), interpretability provides a way to trust and understand the agent's internal reasoning mechanisms for effective use and error correction. in this paper, we provide an overview of this rapidly-evolving sub-field of ai interpretability, introduce the concept of the minimum level of interpretability (mli) and recommend an mli for various types of agents, to aid their safe deployment in real-world settings.
Dami Choi, Yonadav Shavit, David Duvenaud
Abstract: it is important that consumers and regulators can verify the provenance of large neural models to evaluate their capabilities and risks. we introduce the concept of a "proof-of-training-data": any protocol that allows a model trainer to convince a verifier of the training data that produced a set of model weights. such protocols could verify the amount and kind of data and compute used to train the model, including whether it was trained on specific harmful or beneficial data sources. we explore efficient verification strategies for proof-of-training-data that are compatible with most current large-model training procedures. these include a method for the model-trainer to verifiably pre-commit to a random seed used in training, and a method that exploits models' tendency to temporarily overfit to training data in order to detect whether a given data-point was included in training. we show experimentally that our verification procedures can catch a wide variety of attacks, including all known attacks from the proof-of-learning literature.
Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, Lopamudra Praharaj
Abstract: undoubtedly, the evolution of generative ai (genai) models has been the highlight of digital transformation in the year 2022. as the different genai models like chatgpt and google bard continue to foster their complexity and capability, it's critical to understand its consequences from a cybersecurity perspective. several instances recently have demonstrated the use of genai tools in both the defensive and offensive side of cybersecurity, and focusing on the social, ethical and privacy implications this technology possesses. this research paper highlights the limitations, challenges, potential risks, and opportunities of genai in the domain of cybersecurity and privacy. the work presents the vulnerabilities of chatgpt, which can be exploited by malicious users to exfiltrate malicious information bypassing the ethical constraints on the model. this paper demonstrates successful example attacks like jailbreaks, reverse psychology, and prompt injection attacks on the chatgpt. the paper also investigates how cyber offenders can use the genai tools in developing cyber attacks, and explore the scenarios where chatgpt can be used by adversaries to create social engineering attacks, phishing attacks, automated hacking, attack payload generation, malware creation, and polymorphic malware. this paper then examines defense techniques and uses genai tools to improve security measures, including cyber defense automation, reporting, threat intelligence, secure code generation and detection, attack identification, developing ethical guidelines, incidence response plans, and malware detection. we will also discuss the social, legal, and ethical implications of chatgpt. in conclusion, the paper highlights open challenges and future directions to make this genai secure, safe, trustworthy, and ethical as the community understands its cybersecurity impacts.

2023-07-01

Beatriz Borges, Niket Tandon, Tanja Käser, Antoine Bosselut
Abstract: natural language feedback (nlf) is an increasingly popular avenue to align large language models (llms) to human preferences. despite the richness and diversity of the information it can convey, nlf is often hand-designed and arbitrary. in a different world, research in pedagogy has long established several effective feedback models. in this opinion piece, we compile ideas from pedagogy to introduce felt, a feedback framework for llms that outlines the various characteristics of the feedback space, and a feedback content taxonomy based on these variables. our taxonomy offers both a general mapping of the feedback space, as well as pedagogy-established discrete categories, allowing us to empirically demonstrate the impact of different feedback types on revised generations. in addition to streamlining existing nlf designs, felt also brings out new, unexplored directions for research in nlf. we make our taxonomy available to the community, providing guides and examples for mapping our categorizations to future resources.
Yanjiang Guo, Yen-Jen Wang, Lihan Zha, Zheyuan Jiang, Jianyu Chen
Abstract: large language models (llms) encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. previous work has explored how to ground llms in robotic tasks to generate feasible and executable textual plans. however, low-level execution in the physical world may deviate from the high-level textual plan due to environmental perturbations or imperfect controller design. in this paper, we propose \textbf{doremi}, a novel language model grounding framework that enables immediate detection and recovery from misalignments between plan and execution. specifically, we leverage llms to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. then vision language models (vlms) are utilized to detect constraint violations continuously. our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. videos of doremi are available at \url{https://sites.google.com/view/doremi-paper}.
Yi-Ling Chung, Gavin Abercrombie, Florence Enock, Jonathan Bright, Verena Rieser
Abstract: counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate and showing support to targets of abuse. it provides a promising alternative to more contentious measures, such as content moderation and deplatforming, by contributing a greater amount of positive online speech rather than attempting to mitigate harmful content through removal. advances in the development of large language models mean that the process of producing counterspeech could be made more efficient by automating its generation, which would enable large-scale online campaigns. however, we currently lack a systematic understanding of several important factors relating to the efficacy of counterspeech for hate mitigation, such as which types of counterspeech are most effective, what are the optimal conditions for implementation, and which specific effects of hate it can best ameliorate. this paper aims to fill this gap by systematically reviewing counterspeech research in the social sciences and comparing methodologies and findings with computer science efforts in automatic counterspeech generation. by taking this multi-disciplinary view, we identify promising future directions in both fields.

2023-06-30

Xuandong Zhao, Prabhanjan Ananth, Lei Li, Yu-Xiang Wang
Abstract: we study the problem of watermarking large language models (llms) generated text -- one of the most promising approaches for addressing the safety challenges of llm usage. in this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of llm watermarks. we propose a robust and high-quality watermark method, unigram-watermark, by extending an existing approach with a simplified fixed grouping strategy. we prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. experiments on three varying llms and two datasets verify that our unigram-watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of llms. code is available at https://github.com/xuandongzhao/unigram-watermark.
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang
Abstract: large language models (llms) often contain misleading content, emphasizing the need to align them with human values to ensure secur ai systems. reinforcement learning from human feedback (rlhf) has been employed to achieve this alignment by combining a reward model, typically based on bradley-terry paired comparison, with an rl algorithm such as proximal policy optimization (ppo) to optimize llm responses. however, rlhf exhibits complexity, instability, and sensitivity to hyperparameters. in this paper, we propose preference ranking optimization (pro) as an alternative to ppo for directly aligning llms with the bradley-terry comparison. pro extends the pairwise bradley-terry comparison to accommodate preference rankings of any length. by iteratively contrasting the likelihood of generating responses, pro instructs the llm to prioritize the best response while progressively ranking the remaining responses. in this manner, pro effectively transforms human alignment into aligning the probability ranking of $n$ responses generated by llm with the preference ranking of humans towards these responses. experiments have shown that pro outperforms existing alignment algorithms, achieving comparable results to chatgpt and human responses through automatic-based, reward-based, gpt-4, and human evaluations. furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.
Harnoor Dhingra, Preetiha Jayashanker, Sayali Moghe, Emma Strubell
Abstract: large language models (llms) are trained primarily on minimally processed web text, which exhibits the same wide range of social biases held by the humans who created that content. consequently, text generated by llms can inadvertently perpetuate stereotypes towards marginalized groups, like the lgbtqia+ community. in this paper, we perform a comparative study of how llms generate text describing people with different sexual identities. analyzing bias in the text generated by an llm using regard score shows measurable bias against queer people. we then show that a post-hoc method based on chain-of-thought prompting using shap analysis can increase the regard of the sentence, representing a promising approach towards debiasing the output of llms in this setting.

2023-06-28

Yufei Huang, Deyi Xiong
Abstract: holistically measuring societal biases of large language models is crucial for detecting and reducing ethical risks in highly capable ai models. in this work, we present a chinese bias benchmark dataset that consists of over 100k questions jointly constructed by human experts and generative language models, covering stereotypes and societal biases in 14 social dimensions related to chinese culture and values. the curation process contains 4 essential steps: bias identification via extensive literature review, ambiguous context generation, ai-assisted disambiguous context generation, snd manual review \& recomposition. the testing instances in the dataset are automatically derived from 3k+ high-quality templates manually authored with stringent quality control. the dataset exhibits wide coverage and high diversity. extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available chinese large language models exhibiting strong bias in certain categories. additionally, we observe from our experiments that fine-tuned models could, to a certain extent, heed instructions and avoid generating outputs that are morally harmful in some types, in the way of "moral self-correction". our dataset and results are publicly available at \href{https://github.com/yfhuangxxxx/cbbq}{https://github.com/yfhuangxxxx/cbbq}, offering debiasing research opportunities to a widened community.
Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam Mccandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, Deep Ganguli
Abstract: large language models (llms) may not equitably represent diverse global perspectives on societal issues. in this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. we first build a dataset, globalopinionqa, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. next, we define a metric that quantifies the similarity between llm-generated survey responses and human responses, conditioned on country. with our framework, we run three experiments on an llm trained to be helpful, honest, and harmless with constitutional ai. by default, llm responses tend to be more similar to the opinions of certain populations, such as those from the usa, and some european and south american countries, highlighting the potential for biases. when we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. when we translate globalopinionqa questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. we release our dataset for others to use and build on. our data is at https://huggingface.co/datasets/anthropic/llm_global_opinions. we also provide an interactive visualization at https://llmglobalvalues.anthropic.com.
Theodore Zhao, Mu Wei, J. Samuel Preston, Hoifung Poon
Abstract: large language models (llms) have demonstrated remarkable capabilities out of box for a wide range of applications, yet accuracy still remains a major growth area, especially in mission-critical domains such as biomedicine. an effective method to calibrate the confidence level on llm responses is essential to automatically detect errors and facilitate human-in-the-loop verification. an important source of calibration signals stems from expert-stipulated programmatic supervision, which is often available at low cost but has its own limitations such as noise and coverage. in this paper, we introduce a pareto optimal self-supervision framework that can leverage available programmatic supervision to systematically calibrate llm responses by producing a risk score for every response, without any additional manual efforts. this is accomplished by learning a harmonizer model to align llm output with other available supervision sources, which would assign higher risk scores to more uncertain llm responses and facilitate error correction. experiments on standard relation extraction tasks in biomedical and general domains demonstrate the promise of this approach, with our proposed risk scores highly correlated with the real error rate of llms. for the most uncertain test instances, dynamic prompting based on our proposed risk scores results in significant accuracy improvement for off-the-shelf llms, boosting gpt-3 results past state-of-the-art (sota) weak supervision and gpt-4 results past sota supervised results on challenging evaluation datasets.
Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
Abstract: instruction tuning is an effective technique to align large language models (llms) with human intents. in this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. for example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. to achieve this goal, we propose \textit{autopoison}, an automated data poisoning pipeline. it naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle llm. we showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. we quantify and benchmark the strength and the stealthiness of our data poisoning scheme. our results show that autopoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. we hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of llms. code is available at \url{https://github.com/azshue/autopoison}.

2023-06-27

Sophie Jentzsch, Cigdem Turan
Abstract: pretrained language models are publicly available and constantly finetuned for various real-life applications. as they become capable of grasping complex contextual information, harmful biases are likely increasingly intertwined with those models. this paper analyses gender bias in bert models with two main contributions: first, a novel bias measure is introduced, defining biases as the difference in sentiment valuation of female and male sample versions. second, we comprehensively analyse bert's biases on the example of a realistic imdb movie classifier. by systematically varying elements of the training pipeline, we can conclude regarding their impact on the final model bias. seven different public bert models in nine training conditions, i.e. 63 models in total, are compared. almost all conditions yield significant gender biases. results indicate that reflected biases stem from public bert models rather than task-specific data, emphasising the weight of responsible usage.
Salmonn Talebi, Elizabeth Tong, Mohammad R. K. Mofrad
Abstract: the use of large language models (llms) in healthcare is gaining popularity, but their practicality and safety in clinical settings have not been thoroughly assessed. in high-stakes environments like medical settings, trust and safety are critical issues for llms. to address these concerns, we present an approach to evaluate the performance and trustworthiness of a gpt3.5 model for medical image protocol assignment. we compare it with a fine-tuned bert model and a radiologist. in addition, we have a radiologist review the gpt3.5 output to evaluate its decision-making process. our evaluation dataset consists of 4,700 physician entries across 11 imaging protocol classes spanning the entire head. our findings suggest that the gpt3.5 performance falls behind bert and a radiologist. however, gpt3.5 outperforms bert in its ability to explain its decision, detect relevant word indicators, and model calibration. furthermore, by analyzing the explanations of gpt3.5 for misclassifications, we reveal systematic errors that need to be resolved to enhance its safety and suitability for clinical use.
Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang
Abstract: large language models (llms) have been recently leveraged as training data generators for various natural language processing (nlp) tasks. while previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of llm. thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of chatgpt associated with the latter. the data and code are available on \url{https://github.com/yueyu1030/attrprompt}.

2023-06-26

Ismail Sahbane, Francis Rhys Ward, C Henrik Åslund
Abstract: how to detect and mitigate deceptive ai systems is an open problem for the field of safe and trustworthy ai. we analyse two algorithms for mitigating deception: the first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. the second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. we construct two simple games and evaluate our algorithms empirically. we find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.
Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May
Abstract: we present winoqueer: a benchmark specifically designed to measure whether large language models (llms) encode biases that are harmful to the lgbtq+ community. the benchmark is community-sourced, via application of a novel method that generates a bias benchmark from a community survey. we apply our benchmark to several popular llms and find that off-the-shelf models generally do exhibit considerable anti-queer bias. finally, we show that llm bias against a marginalized community can be somewhat mitigated by finetuning on data written about or by members of that community, and that social media text written by community members is more effective than news text written about the community by non-members. our method for community-in-the-loop benchmark development provides a blueprint for future researchers to develop community-driven, harms-grounded llm benchmarks for other marginalized communities.
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt
Abstract: large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." these models should respond helpfully to user questions, but refuse to answer requests that could cause harm. however, adversarial users can construct inputs which circumvent attempts at alignment. in this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). these inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. we show that existing nlp-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current nlp-based attacks fail, we can find adversarial inputs with brute force. as a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. however the recent trend in large-scale ml models is multimodal models that allow users to provide images that influence the text that is generated. we show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. we conjecture that improved nlp attacks may demonstrate this same level of adversarial control over text-only models.

2023-06-25

Mohamed Amine Ferrag, Mthandazo Ndhlovu, Norbert Tihanyi, Lucas C. Cordeiro, Merouane Debbah, Thierry Lestable
Abstract: natural language processing (nlp) domain is experiencing a revolution due to the capabilities of pre-trained large language models ( llms), fueled by ground-breaking transformers architecture, resulting into unprecedented advancements. their exceptional aptitude for assessing probability distributions of text sequences is the primary catalyst for outstanding improvement of both the precision and efficiency of nlp models. this paper introduces for the first time securityllm, a pre-trained language model designed for cybersecurity threats detection. the securityllm model is articulated around two key generative elements: securitybert and falconllm. securitybert operates as a cyber threat detection mechanism, while falconllm is an incident response and recovery system. to the best of our knowledge, securitybert represents the inaugural application of bert in cyber threat detection. despite the unique nature of the input data and features, such as the reduced significance of syntactic structures in content classification, the suitability of bert for this duty demonstrates unexpected potential, thanks to our pioneering study. we reveal that a simple classification model, created from scratch, and consolidated with llms, exceeds the performance of established traditional machine learning (ml) and deep learning (dl) methods in cyber threat detection, like convolutional neural networks (cnn) or recurrent neural networks (rnn). the experimental analysis, conducted using a collected cybersecurity dataset, proves that our securityllm model can identify fourteen (14) different types of attacks with an overall accuracy of 98%

2023-06-24

Reza Fayyazi, Shanchieh Jay Yang
Abstract: the volume, variety, and velocity of change in vulnerabilities and exploits have made incident threat analysis challenging with human expertise and experience along. tactics, techniques, and procedures (ttps) are to describe how and why attackers exploit vulnerabilities. however, a ttp description written by one security professional can be interpreted very differently by another, leading to confusion in cybersecurity operations or even business, policy, and legal decisions. meanwhile, advancements in ai have led to the increasing use of natural language processing (nlp) algorithms to assist the various tasks in cyber operations. with the rise of large language models (llms), nlp tasks have significantly improved because of the llm's semantic understanding and scalability. this leads us to question how well llms can interpret ttps or general cyberattack descriptions to inform analysts of the intended purposes of cyberattacks. we propose to analyze and compare the direct use of llms (e.g., gpt-3.5) versus supervised fine-tuning (sft) of small-scale-llms (e.g., bert) to study their capabilities in predicting att&ck tactics. our results reveal that the small-scale-llms with sft provide a more focused and clearer differentiation between the att&ck tactics (if such differentiation exists). on the other hand, direct use of llms offer a broader interpretation of cyberattack techniques. when treating more general cases, despite the power of llms, inherent ambiguity exists and limits their predictive power. we then summarize the challenges and recommend research directions on llms to treat the inherent ambiguity of ttp descriptions used in various cyber operations.

2023-06-23

Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
Abstract: with the rise of large language models (llms) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative. for example, a company deploying a client-facing chatbot must ensure that the model will not respond to client requests with profanity. current evaluations approach this problem using small, domain-specific datasets with human-curated labels. these evaluation sets are often sampled from a narrow and simplified distribution, and data sources can unknowingly be leaked into the training set which can lead to misleading evaluations. to bypass these drawbacks, we propose a framework for self-supervised evaluation of llms by analyzing their sensitivity or invariance to transformations on the input text. self-supervised evaluation can directly monitor llm behavior on datasets collected in the wild or streamed during live model deployment. we demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence, in addition to sensitivity to grammatical structure and tokenization errors. when comparisons to similar human-labeled benchmarks are available, we find strong correlations between self-supervised and human-supervised evaluations. the self-supervised paradigm complements current evaluation strategies that rely on labeled data.

2023-06-22

Jonathan H. Rystrøm
Abstract: as generative language models are deployed in ever-wider contexts, concerns about their political values have come to the forefront with critique from all parts of the political spectrum that the models are biased and lack neutrality. however, the question of what neutrality is and whether it is desirable remains underexplored. in this paper, i examine neutrality through an audit of delphi [arxiv:2110.07574], a large language model designed for crowdsourced ethics. i analyse how delphi responds to politically controversial questions compared to different us political subgroups. i find that delphi is poorly calibrated with respect to confidence and exhibits a significant political skew. based on these results, i examine the question of neutrality from a data-feminist lens, in terms of how notions of neutrality shift power and further marginalise unheard voices. these findings can hopefully contribute to a more reflexive debate about the normative questions of alignment and what role we want generative models to play in society.
Subash Neupane, Ivan A. Fernandez, Sudip Mittal, Shahram Rahimi
Abstract: generative artificial intelligence (genai) has emerged as a powerful technology capable of autonomously producing highly realistic content in various domains, such as text, images, audio, and videos. with its potential for positive applications in creative arts, content generation, virtual assistants, and data synthesis, genai has garnered significant attention and adoption. however, the increasing adoption of genai raises concerns about its potential misuse for crafting convincing phishing emails, generating disinformation through deepfake videos, and spreading misinformation via authentic-looking social media posts, posing a new set of challenges and risks in the realm of cybersecurity. to combat the threats posed by genai, we propose leveraging the cyber kill chain (ckc) to understand the lifecycle of cyberattacks, as a foundational model for cyber defense. this paper aims to provide a comprehensive analysis of the risk areas introduced by the offensive use of genai techniques in each phase of the ckc framework. we also analyze the strategies employed by threat actors and examine their utilization throughout different phases of the ckc, highlighting the implications for cyber defense. additionally, we propose genai-enabled defense strategies that are both attack-aware and adaptive. these strategies encompass various techniques such as detection, deception, and adversarial training, among others, aiming to effectively mitigate the risks posed by genai-induced cyber threats.
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi
Abstract: the task of empowering large language models (llms) to accurately express their confidence, referred to as confidence elicitation, is essential in ensuring reliable and trustworthy decision-making processes. previous methods, which primarily rely on model logits, have become less suitable for llms and even infeasible with the rise of closed-source llms (e.g., commercialized llm apis). this leads to a growing need to explore the untapped area of \emph{non-logit-based} approaches to estimate the uncertainty of llms. hence, in this study, we investigate approaches for confidence elicitation that do not require model fine-tuning or access to proprietary information. we introduce three categories of methods: verbalize-based, consistency-based, and their hybrid methods for benchmarking, and evaluate their performance across five types of datasets and four widely-used llms. our analysis of these methods uncovers several key insights: 1) llms often exhibit a high degree of overconfidence when verbalizing their confidence; 2) prompting strategies such as cot, top-k and multi-step confidences improve calibration of verbalized confidence; 3) consistency-based methods outperform the verbalized confidences in most cases, with particularly notable improvements on the arithmetic reasoning task; 4) hybrid methods consistently deliver the best performance over their baselines, thereby emerging as a promising state-of-the-art approach; 5) despite these advancements, all investigated methods continue to struggle with challenging tasks, such as those requiring professional knowledge, leaving significant scope for improvement of confidence elicitation.
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, Prateek Mittal
Abstract: recently, there has been a surge of interest in integrating vision into large language models (llms), exemplified by visual language models (vlms) such as flamingo and gpt-4. this paper sheds light on the security and safety implications of this trend. first, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated llms. second, we highlight that the versatility of llms also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. as an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned llms with integrated vision. intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned llm, compelling it to heed a wide range of harmful instructions that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. our study underscores the escalating adversarial risks associated with the pursuit of multimodality. our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of ai alignment. the presented attack suggests a fundamental adversarial challenge for ai alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

2023-06-21

Isaac David, Liyi Zhou, Kaihua Qin, Dawn Song, Lorenzo Cavallaro, Arthur Gervais
Abstract: we investigate the feasibility of employing large language models (llms) for conducting the security audit of smart contracts, a traditionally time-consuming and costly process. our research focuses on the optimization of prompt engineering for enhanced security analysis, and we evaluate the performance and accuracy of llms using a benchmark dataset comprising 52 decentralized finance (defi) smart contracts that have previously been compromised. our findings reveal that, when applied to vulnerable contracts, both gpt-4 and claude models correctly identify the vulnerability type in 40% of the cases. however, these models also demonstrate a high false positive rate, necessitating continued involvement from manual auditors. the llms tested outperform a random model by 20% in terms of f1-score. to ensure the integrity of our study, we conduct mutation testing on five newly developed and ostensibly secure smart contracts, into which we manually insert two and 15 vulnerabilities each. this testing yielded a remarkable best-case 78.7% true positive rate for the gpt-4-32k model. we tested both, asking the models to perform a binary classification on whether a contract is vulnerable, and a non-binary prompt. we also examined the influence of model temperature variations and context length on the llm's performance. despite the potential for many further enhancements, this work lays the groundwork for a more efficient and economical approach to smart contract security audits.
Risako Ando, Takanobu Morishita, Hirohiko Abe, Koji Mineshima, Mitsuhiro Okada
Abstract: this paper investigates whether current large language models exhibit biases in logical reasoning, similar to humans. specifically, we focus on syllogistic reasoning, a well-studied form of inference in the cognitive science of human deduction. to facilitate our analysis, we introduce a dataset called neubaroco, originally designed for psychological experiments that assess human logical abilities in syllogistic reasoning. the dataset consists of syllogistic inferences in both english and japanese. we examine three types of biases observed in human syllogistic reasoning: belief biases, conversion errors, and atmosphere effects. our findings demonstrate that current large language models struggle more with problems involving these three types of biases.
Kanishk Gandhi, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman
Abstract: as large language models (llms) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. however, despite the recent attempts to assess the theory-of-mind (tom) reasoning capabilities of llms, the degree to which these models can align with human tom remains a nuanced topic of exploration. this is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. to address these challenges, we present a novel framework for procedurally generating evaluations with llms by populating causal templates. using our framework, we create a new social reasoning benchmark (bigtom) for llms which consists of 25 controls and 5,000 model-written evaluations. we find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. using bigtom, we evaluate the social reasoning capabilities of a variety of llms and compare model performances with human performance. our results suggest that gpt4 has tom capabilities that mirror human inference patterns, though less reliable, while other llms struggle.

2023-06-20

Yue Huang, Qihui Zhang, Philip S. Y, Lichao Sun
Abstract: large language models (llms) such as chatgpt, have gained significant attention due to their impressive natural language processing capabilities. it is crucial to prioritize human-centered principles when utilizing these models. safeguarding the ethical and moral compliance of llms is of utmost importance. however, individual ethical issues have not been well studied on the latest llms. therefore, this study aims to address these gaps by introducing a new benchmark -- trustgpt. trustgpt provides a comprehensive evaluation of llms in three crucial areas: toxicity, bias, and value-alignment. initially, trustgpt examines toxicity in language models by employing toxic prompt templates derived from social norms. it then quantifies the extent of bias in models by measuring quantifiable toxicity values across different groups. lastly, trustgpt assesses the value of conversation generation models from both active value-alignment and passive value-alignment tasks. through the implementation of trustgpt, this research aims to enhance our understanding of the performance of conversation generation models and promote the development of language models that are more ethical and socially responsible.
Shawn Curran, Sam Lansley, Oliver Bethell
Abstract: the legal profession necessitates a multidimensional approach that involves synthesizing an in-depth comprehension of a legal issue with insightful commentary based on personal experience, combined with a comprehensive understanding of pertinent legislation, regulation, and case law, in order to deliver an informed legal solution. the present offering with generative ai presents major obstacles in replicating this, as current models struggle to integrate and navigate such a complex interplay of understanding, experience, and fact-checking procedures. it is noteworthy that where generative ai outputs understanding and experience, which reflect the aggregate of various subjective views on similar topics, this often deflects the model's attention from the crucial legal facts, thereby resulting in hallucination. hence, this paper delves into the feasibility of three independent llms, each focused on understanding, experience, and facts, synthesising as one single ensemble model to effectively counteract the current challenges posed by the existing monolithic generative ai models. we introduce an idea of mutli-length tokenisation to protect key information assets like common law judgements, and finally we interrogate the most advanced publicly available models for legal hallucination, with some interesting results.
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li
Abstract: generative pre-trained transformer (gpt) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. yet, while the literature on the trustworthiness of gpt models remains limited, practitioners have proposed employing capable gpt models for sensitive applications to healthcare and finance - where mistakes can be costly. to this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on gpt-4 and gpt-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. for instance, we find that gpt models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. we also find that although gpt-4 is usually more trustworthy than gpt-3.5 on standard benchmarks, gpt-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that gpt-4 follows the (misleading) instructions more precisely. our work illustrates a comprehensive trustworthiness evaluation of gpt models and sheds light on the trustworthiness gaps. our benchmark is publicly available at https://decodingtrust.github.io/.
Christopher T. Small, Ivan Vendrov, Esin Durmus, Hadjar Homaei, Elizabeth Barry, Julien Cornebise, Ted Suzman, Deep Ganguli, Colin Megill
Abstract: polis is a platform that leverages machine intelligence to scale up deliberative processes. in this paper, we explore the opportunities and risks associated with applying large language models (llms) towards challenges with facilitating, moderating and summarizing the results of polis engagements. in particular, we demonstrate with pilot experiments using anthropic's claude that llms can indeed augment human intelligence to help more efficiently run polis conversations. in particular, we find that summarization capabilities enable categorically new methods with immense promise to empower the public in collective meaning-making exercises. and notably, llm context limitations have a significant impact on insight and quality of these results. however, these opportunities come with risks. we discuss some of these risks, as well as principles and techniques for characterizing and mitigating them, and the implications for other deliberative or political systems that may employ llms. finally, we conclude with several open future research directions for augmenting tools like polis with llms.
Dan Hendrycks, Mantas Mazeika, Thomas Woodside
Abstract: rapid advancements in artificial intelligence (ai) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced ai systems to pose catastrophic risks. although numerous risks have been detailed separately, there is a pressing need for a systematic discussion and illustration of the potential dangers to better inform efforts to mitigate them. this paper provides an overview of the main sources of catastrophic ai risks, which we organize into four categories: malicious use, in which individuals or groups intentionally use ais to cause harm; ai race, in which competitive environments compel actors to deploy unsafe ais or cede control to ais; organizational risks, highlighting how human factors and complex systems can increase the chances of catastrophic accidents; and rogue ais, describing the inherent difficulty in controlling agents far more intelligent than humans. for each category of risk, we describe specific hazards, present illustrative stories, envision ideal scenarios, and propose practical suggestions for mitigating these dangers. our goal is to foster a comprehensive understanding of these risks and inspire collective and proactive efforts to ensure that ais are developed and deployed in a safe manner. ultimately, we hope this will allow us to realize the benefits of this powerful technology while minimizing the potential for catastrophic outcomes.

2023-06-19

Matija Franklin, Rebecca Gorman, Hal Ashton, Stuart Armstrong
Abstract: this article is a primer on concept extrapolation - the ability to take a concept, a feature, or a goal that is defined in one context and extrapolate it safely to a more general context. concept extrapolation aims to solve model splintering - a ubiquitous occurrence wherein the features or concepts shift as the world changes over time. through discussing value splintering and value extrapolation the article argues that concept extrapolation is necessary for artificial intelligence alignment.
Louis Rosenberg
Abstract: the technology of conversational ai has made significant advancements over the last eighteen months. as a consequence, conversational agents are likely to be deployed in the near future that are designed to pursue targeted influence objectives. sometimes referred to as the "ai manipulation problem," the emerging risk is that consumers will unwittingly engage in real-time dialog with predatory ai agents that can skillfully persuade them to buy particular products, believe particular pieces of misinformation, or fool them into revealing sensitive personal data. for many users, current systems like chatgpt and lamda feel safe because they are primarily text-based, but the industry is already shifting towards real-time voice and photorealistic digital personas that look, move, and express like real people. this will enable the deployment of agenda-driven virtual spokespeople (vsps) that will be highly persuasive through real-time adaptive influence. this paper explores the manipulative tactics that are likely to be deployed through conversational ai agents, the unique threats such agents pose to the epistemic agency of human users, and the emerging need for policymakers to protect against the most likely predatory practices.

2023-06-18

Praneeth Nemani, Yericherla Deepak Joel, Palla Vijay, Farhana Ferdousi Liza
Abstract: gender bias in artificial intelligence (ai) has emerged as a pressing concern with profound implications for individuals' lives. this paper presents a comprehensive survey that explores gender bias in transformer models from a linguistic perspective. while the existence of gender bias in language models has been acknowledged in previous studies, there remains a lack of consensus on how to effectively measure and evaluate this bias. our survey critically examines the existing literature on gender bias in transformers, shedding light on the diverse methodologies and metrics employed to assess bias. several limitations in current approaches to measuring gender bias in transformers are identified, encompassing the utilization of incomplete or flawed metrics, inadequate dataset sizes, and a dearth of standardization in evaluation methods. furthermore, our survey delves into the potential ramifications of gender bias in transformers for downstream applications, including dialogue systems and machine translation. we underscore the importance of fostering equity and fairness in these systems by emphasizing the need for heightened awareness and accountability in developing and deploying language technologies. this paper serves as a comprehensive overview of gender bias in transformer models, providing novel insights and offering valuable directions for future research in this critical domain.
Xiao Zhan, Yifan Xu, Stefan Sarkadi
Abstract: chatgpt, an ai chatbot, has gained popularity for its capability in generating human-like responses. however, this feature carries several risks, most notably due to its deceptive behaviour such as offering users misleading or fabricated information that could further cause ethical issues. to better understand the impact of chatgpt on our social, cultural, economic, and political interactions, it is crucial to investigate how chatgpt operates in the real world where various societal pressures influence its development and deployment. this paper emphasizes the need to study chatgpt "in the wild", as part of the ecosystem it is embedded in, with a strong focus on user involvement. we examine the ethical challenges stemming from chatgpt's deceptive human-like interactions and propose a roadmap for developing more transparent and trustworthy chatbots. central to our approach is the importance of proactive risk assessment and user participation in shaping the future of chatbot technology.

2023-06-16

Stefan F. Schouten, Baran Barbarestani, Wondimagegnhue Tufa, Piek Vossen, Ilia Markov
Abstract: given the dynamic nature of toxic language use, automated methods for detecting toxic spans are likely to encounter distributional shift. to explore this phenomenon, we evaluate three approaches for detecting toxic spans under cross-domain conditions: lexicon-based, rationale extraction, and fine-tuned language models. our findings indicate that a simple method using off-the-shelf lexicons performs best in the cross-domain setup. the cross-domain error analysis suggests that (1) rationale extraction methods are prone to false negatives, while (2) language models, despite performing best for the in-domain case, recall fewer explicitly toxic words than lexicons and are prone to certain types of false positives. our code is publicly available at: https://github.com/sfschouten/toxic-cross-domain.
Victor Steinborn, Antonis Maronikolakis, Hinrich Schütze
Abstract: in efforts to keep up with the rapid progress and use of large language models, gender bias research is becoming more prevalent in nlp. non-english bias research, however, is still in its infancy with most work focusing on english. in our work, we study how grammatical gender bias relating to politeness levels manifests in japanese and korean language models. linguistic studies in these languages have identified a connection between gender bias and politeness levels, however it is not yet known if language models reproduce these biases. we analyze relative prediction probabilities of the male and female grammatical genders using templates and find that informal polite speech is most indicative of the female grammatical gender, while rude and formal speech is most indicative of the male grammatical gender. further, we find politeness levels to be an attack vector for allocational gender bias in cyberbullying detection models. cyberbullies can evade detection through simple techniques abusing politeness levels. we introduce an attack dataset to (i) identify representational gender bias across politeness levels, (ii) demonstrate how gender biases can be abused to bypass cyberbullying detection models and (iii) show that allocational biases can be mitigated via training on our proposed dataset. through our findings we highlight the importance of bias research moving beyond its current english-centrism.

2023-06-15

Myles Foley, Ambrish Rawat, Taesung Lee, Yufang Hou, Gabriele Picco, Giulio Zizzo
Abstract: the wide applicability and adaptability of generative large language models (llms) has enabled their rapid adoption. while the pre-trained models can perform many tasks, such models are often fine-tuned to improve their performance on various downstream applications. however, this leads to issues over violation of model licenses, model theft, and copyright infringement. moreover, recent advances show that generative technology is capable of producing harmful content which exacerbates the problems of accountability within model supply chains. thus, we need a method to investigate how a model was trained or a piece of text was generated and what their pre-trained base model was. in this paper we take the first step to address this open problem by tracing back the origin of a given fine-tuned llm to its corresponding pre-trained base model. we consider different knowledge levels and attribution strategies, and find that we can correctly trace back 8 out of the 10 fine tuned models with our best method.
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell
Abstract: deploying large language models (lms) can pose hazards from harmful outputs such as toxic or false text. prior work has introduced automated tools that elicit harmful outputs to identify these risks. while this is a valuable step toward securing models, these approaches rely on a pre-existing way to efficiently classify undesirable outputs. using a pre-existing classifier does not allow for red-teaming to be tailored to the target model. furthermore, when failures can be easily classified in advance, red-teaming has limited marginal value because problems can be avoided by simply filtering training data and/or model outputs. here, we consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. our framework consists of three steps: 1) exploring the model's range of behaviors in the desired context; 2) establishing a definition and measurement for undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) exploiting the model's flaws using this measure to develop diverse adversarial prompts. we use this approach to red-team gpt-3 to discover classes of inputs that elicit false statements. in doing so, we construct the commonclaim dataset of 20,000 statements labeled by humans as common-knowledge-true, common knowledge-false, or neither. we are making code and data available.

2023-06-13

Fabien Roger
Abstract: when using adversarial training, it is common practice to train against the most egregious failures. however, this might imply using examples with sensitive information (such as leaked passwords or security vulnerabilities) as training data. one might assume that language models trained with gradient descent never generate text snippets which were only present in examples associated with the lowest possible reward. in this paper, we show that this assumption is wrong: in some situations, large language models do learn from such negatively-reinforced examples. we present a specific training setup that enables pythia-160m to guess passwords 13% more often than it would by guessing randomly, despite only showing it these passwords on examples where the model is incentivized to not output these passwords. our code is available at www.github.com/fabienroger/learning-from-negative-examples
Arno Candel, Jon Mckinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Prithvi Prabhu, Jeff Gambera, Mark Landry, Shivam Bansal, Ryan Chesler, Chun Ming Lee, Marcos V. Conde, Pasha Stetsenko, Olivier Grellier, Srisatish Ambati
Abstract: applications built on top of large language models (llms) such as gpt-4 represent a revolution in ai due to their human-level capabilities in natural language processing. however, they also pose many significant risks such as the presence of biased, private, or harmful text, and the unauthorized inclusion of copyrighted material. we introduce h2ogpt, a suite of open-source code repositories for the creation and use of llms based on generative pretrained transformers (gpts). the goal of this project is to create the world's best truly open-source alternative to closed-source approaches. in collaboration with and as part of the incredible and unstoppable open-source community, we open-source several fine-tuned h2ogpt models from 7 to 40 billion parameters, ready for commercial use under fully permissive apache 2.0 licenses. included in our release is 100\% private document search using natural language. open-source language models help boost ai development and make it more accessible and trustworthy. they lower entry hurdles, allowing people and groups to tailor these models to their needs. this openness increases innovation, transparency, and fairness. an open-source strategy is needed to share ai benefits fairly, and h2o.ai will continue to democratize ai and llms.
Mars Gokturk Buchholz
Abstract: the detection of political fake statements is crucial for maintaining information integrity and preventing the spread of misinformation in society. historically, state-of-the-art machine learning models employed various methods for detecting deceptive statements. these methods include the use of metadata (w. wang et al., 2018), n-grams analysis (singh et al., 2021), and linguistic (wu et al., 2022) and stylometric (islam et al., 2020) features. recent advancements in large language models, such as gpt-3 (brown et al., 2020) have achieved state-of-the-art performance on a wide range of tasks. in this study, we conducted experiments with gpt-3 on the liar dataset (w. wang et al., 2018) and achieved higher accuracy than state-of-the-art models without using any additional meta or linguistic features. additionally, we experimented with zero-shot learning using a carefully designed prompt and achieved near state-of-the-art performance. an advantage of this approach is that the model provided evidence for its decision, which adds transparency to the model's decision-making and offers a chance for users to verify the validity of the evidence provided.
Zhigang Kan, Linbo Qiao, Hao Yu, Liwen Peng, Yifu Gao, Dongsheng Li
Abstract: large language models (llms) are gaining increasing attention due to their exceptional performance across numerous tasks. as a result, the general public utilize them as an influential tool for boosting their productivity while natural language processing researchers endeavor to employ them in solving existing or new research problems. unfortunately, individuals can only access such powerful ais through apis, which ultimately leads to the transmission of raw data to the models' providers and increases the possibility of privacy data leakage. current privacy-preserving methods for cloud-deployed language models aim to protect privacy information in the pre-training dataset or during the model training phase. however, they do not meet the specific challenges presented by the remote access approach of new large-scale language models. this paper introduces a novel task, "user privacy protection for dialogue models," which aims to safeguard sensitive user information from any possible disclosure while conversing with chatbots. we also present an evaluation scheme for this task, which covers evaluation metrics for privacy protection, data availability, and resistance to simulation attacks. moreover, we propose the first framework for this task, namely privacy protection through text sanitization. before sending the input to remote large models, it filters out the sensitive information, using several rounds of text sanitization based on privacy types that users define. upon receiving responses from the larger model, our framework automatically restores privacy to ensure that the conversation goes smoothly, without intervention from the privacy filter. experiments based on real-world datasets demonstrate the efficacy of our privacy-preserving approach against eavesdropping from potential attackers.

2023-06-12

Andrew Critch, Stuart Russell
Abstract: while several recent works have identified societal-scale and extinction-level risks to humanity arising from artificial intelligence, few have attempted an {\em exhaustive taxonomy} of such risks. many exhaustive taxonomies are possible, and some are useful -- particularly if they reveal new risks or practical approaches to safety. this paper explores a taxonomy based on accountability: whose actions lead to the risk, are the actors unified, and are they deliberate? we also provide stories to illustrate how the various risk types could each play out, including risks arising from unanticipated interactions of many ai systems, as well as risks from deliberate misuse, for which combined technical and policy solutions are indicated.
Minhyeok Lee
Abstract: generative language models (glms) have the potential to significantly shape our linguistic landscape due to their expansive use in various digital applications. however, this widespread adoption might inadvertently trigger a self-reinforcement learning cycle that can amplify existing linguistic biases. this paper explores the possibility of such a phenomenon, where the initial biases in glms, reflected in their generated text, can feed into the learning material of subsequent models, thereby reinforcing and amplifying these biases. moreover, the paper highlights how the pervasive nature of glms might influence the linguistic and cognitive development of future generations, as they may unconsciously learn and reproduce these biases. the implications of this potential self-reinforcement cycle extend beyond the models themselves, impacting human language and discourse. the advantages and disadvantages of this bias amplification are weighed, considering educational benefits and ease of future glm learning against threats to linguistic diversity and dependence on initial glms. this paper underscores the need for rigorous research to understand and address these issues. it advocates for improved model transparency, bias-aware training techniques, development of methods to distinguish between human and glm-generated text, and robust measures for fairness and bias evaluation in glms. the aim is to ensure the effective, safe, and equitable use of these powerful technologies, while preserving the richness and diversity of human language.
Yanchen Wang, Lisa Singh
Abstract: generative ai models continue to become more powerful. the launch of chatgpt in november 2022 has ushered in a new era of ai. chatgpt and other similar chatbots have a range of capabilities, from answering student homework questions to creating music and art. there are already concerns that humans may be replaced by chatbots for a variety of jobs. because of the wide spectrum of data chatbots are built on, we know that they will have human errors and human biases built into them. these biases may cause significant harm and/or inequity toward different subpopulations. to understand the strengths and weakness of chatbot responses, we present a position paper that explores different use cases of chatgpt to determine the types of questions that are answered fairly and the types that still need improvement. we find that chatgpt is a fair search engine for the tasks we tested; however, it has biases on both text generation and code generation. we find that chatgpt is very sensitive to changes in the prompt, where small changes lead to different levels of fairness. this suggests that we need to immediately implement "corrections" or mitigation strategies in order to improve fairness of these systems. we suggest different strategies to improve chatbots and also advocate for an impartial review panel that has access to the model parameters to measure the levels of different types of biases and then recommends safeguards that move toward responses that are less discriminatory and more accurate.
Ethan Mollick, Lilach Mollick
Abstract: this paper examines the transformative role of large language models (llms) in education and their potential as learning tools, despite their inherent risks and limitations. the authors propose seven approaches for utilizing ai in classrooms: ai-tutor, ai-coach, ai-mentor, ai-teammate, ai-tool, ai-simulator, and ai-student, each with distinct pedagogical benefits and risks. the aim is to help students learn with and about ai, with practical strategies designed to mitigate risks such as complacency about the ai's output, errors, and biases. these strategies promote active oversight, critical assessment of ai outputs, and complementarity of ai's capabilities with the students' unique insights. by challenging students to remain the "human in the loop," the authors aim to enhance learning outcomes while ensuring that ai serves as a supportive tool rather than a replacement. the proposed framework offers a guide for educators navigating the integration of ai-assisted learning in classrooms

2023-06-11

Jiaqi Xue, Mengxin Zheng, Ting Hua, Yilin Shen, Yepeng Liu, Ladislau Boloni, Qian Lou
Abstract: large language models (llms) are progressively being utilized as machine learning services and interface tools for various applications. however, the security implications of llms, particularly in relation to adversarial and trojan attacks, remain insufficiently examined. in this paper, we propose trojllm, an automatic and black-box framework to effectively generate universal and stealthy triggers. when these triggers are incorporated into the input data, the llms' outputs can be maliciously manipulated. moreover, the framework also supports embedding trojans within discrete prompts, enhancing the overall effectiveness and precision of the triggers' attacks. specifically, we propose a trigger discovery algorithm for generating universal triggers for various inputs by querying victim llm-based apis using few-shot data samples. furthermore, we introduce a novel progressive trojan poisoning algorithm designed to generate poisoned prompts that retain efficacy and transferability across a diverse range of models. our experiments and results demonstrate trojllm's capacity to effectively insert trojans into text prompts in real-world black-box llm apis including gpt-3.5 and gpt-4, while maintaining exceptional performance on clean test sets. our work sheds light on the potential security risks in current models and offers a potential defensive approach. the source code of trojllm is available at https://github.com/ucf-ml-research/trojllm.

2023-06-09

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
Abstract: evaluating large language model (llm) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. to address this, we explore using strong llms as judges to evaluate these models on more open-ended questions. we examine the usage and limitations of llm-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. we then verify the agreement between llm judges and human preferences by introducing two benchmarks: mt-bench, a multi-turn question set; and chatbot arena, a crowdsourced battle platform. our results reveal that strong llm judges like gpt-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. hence, llm-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of llama and vicuna. the mt-bench questions, 3k expert votes, and 30k conversations with human preferences are publicly available at https://github.com/lm-sys/fastchat/tree/main/fastchat/llm_judge.
Takashi Koide, Naoki Fukushi, Hiroki Nakano, Daiki Chiba
Abstract: the rise of large language models (llms) has had a significant impact on various domains, including natural language processing and artificial intelligence. while llms such as chatgpt have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. to combat the rising tide of automated cyber attacks facilitated by llms, it is imperative to automate the detection of malicious web content, which requires approaches that leverage the power of llms to analyze and classify phishing sites. in this paper, we propose a novel method that utilizes chatgpt to detect phishing sites. our approach involves leveraging a web crawler to gather information from websites and generate prompts based on this collected data. this approach enables us to detect various phishing sites without the need for fine-tuning machine learning models and identify social engineering techniques from the context of entire websites and urls. to evaluate the performance of our proposed method, we conducted experiments using a dataset. the experimental results using gpt-4 demonstrated promising performance, with a precision of 98.3% and a recall of 98.4%. comparative analysis between gpt-3.5 and gpt-4 revealed an enhancement in the latter's capability to reduce false negatives. these findings not only highlight the potential of llms in efficiently identifying phishing sites but also have significant implications for enhancing cybersecurity measures and protecting users from the dangers of online fraudulent activities.
Wissam Antoun, Virginie Mouilleron, Benoît Sagot, Djamé Seddah
Abstract: recent advances in natural language processing (nlp) have led to the development of large language models (llms) such as chatgpt. this paper proposes a methodology for developing and evaluating chatgpt detectors for french text, with a focus on investigating their robustness on out-of-domain data and against common attack schemes. the proposed method involves translating an english dataset into french and training a classifier on the translated data. results show that the detectors can effectively detect chatgpt-generated text, with a degree of robustness against basic attack techniques in in-domain settings. however, vulnerabilities are evident in out-of-domain contexts, highlighting the challenge of detecting adversarial text. the study emphasizes caution when applying in-domain testing results to a wider variety of content. we provide our translated datasets and models as open-source resources. https://gitlab.inria.fr/wantoun/robust-chatgpt-detection
Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, Alexandra Sasha Luccioni, Alberto Lusoli, Margaret Mitchell, Jessica Newman, Marie-Therese Png, Andrew Strait, Apostol Vassilev
Abstract: generative ai systems across modalities, ranging from text, image, audio, and video, have broad social impacts, but there exists no official standard for means of evaluating those impacts and which impacts should be evaluated. we move toward a standard approach in evaluating a generative ai system for any modality, in two overarching categories: what is able to be evaluated in a base system that has no predetermined application and what is able to be evaluated in society. we describe specific social impact categories and how to approach and conduct evaluations in the base technical system, then in people and society. our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. suggested methods for evaluation apply to all modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. we offer five overarching categories for what is able to be evaluated in society, each with their own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. each subcategory includes recommendations for mitigating harm. we are concurrently crafting an evaluation repository for the ai research community to contribute existing evaluations along the given categories. this version will be updated following a craft session at acm facct 2023.
Ryan Mccoppin, Marla Kennedy, Platon Lukyanenko, Sean Kennedy
Abstract: including human analysis has the potential to positively affect the robustness of deep neural networks and is relatively unexplored in the adversarial machine learning literature. neural network visual explanation maps have been shown to be prone to adversarial attacks. further research is needed in order to select robust visualizations of explanations for the image analyst to evaluate a given model. these factors greatly impact human-in-the-loop (hitl) evaluation tools due to their reliance on adversarial images, including explanation maps and measurements of robustness. we believe models of human visual attention may improve interpretability and robustness of human-machine imagery analysis systems. our challenge remains, how can hitl evaluation be robust in this adversarial landscape?
Philip Feldman, James R. Foulds, Shimei Pan
Abstract: recent advances in large language models (llms), such as chatgpt, have led to highly sophisticated conversation agents. however, these models suffer from "hallucinations," where the model generates false or fabricated information. addressing this challenge is crucial, particularly with ai-driven platforms being adopted across various sectors. in this paper, we propose a novel method to recognize and flag instances when llms perform outside their domain knowledge, and ensuring users receive accurate information. we find that the use of context combined with embedded tags can successfully combat hallucinations within generative language models. to do this, we baseline hallucination frequency in no-context prompt-response pairs using generated urls as easily-tested indicators of fabricated data. we observed a significant reduction in overall hallucination when context was supplied along with question prompts for tested generative engines. lastly, we evaluated how placing tags within contexts impacted model responses and were able to eliminate hallucinations in responses with 98.88% effectiveness.
Aisha Khatun, Daniel G. Brown
Abstract: large language models (llms) have become mainstream technology with their versatile use cases and impressive performance. despite the countless out-of-the-box applications, llms are still not reliable. a lot of work is being done to improve the factual accuracy, consistency, and ethical standards of these models through fine-tuning, prompting, and reinforcement learning with human feedback (rlhf), but no systematic analysis of the responses of these models to different categories of statements, or on their potential vulnerabilities to simple prompting changes is available. in this work, we analyze what confuses gpt-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response. we find that gpt-3 correctly disagrees with obvious conspiracies and stereotypes but makes mistakes with common misconceptions and controversies. the model responses are inconsistent across prompts and settings, highlighting gpt-3's unreliability. dataset and code of our analysis is available in https://github.com/tanny411/gpt3-reliability-check.
João Phillipe Cardenuto, Jing Yang, Rafael Padilha, Renjie Wan, Daniel Moreira, Haoliang Li, Shiqi Wang, Fernanda Andaló, Sébastien Marcel, Anderson Rocha
Abstract: synthetic realities are digital creations or augmentations that are contextually generated through the use of artificial intelligence (ai) methods, leveraging extensive amounts of data to construct new narratives or realities, regardless of the intent to deceive. in this paper, we delve into the concept of synthetic realities and their implications for digital forensics and society at large within the rapidly advancing field of ai. we highlight the crucial need for the development of forensic techniques capable of identifying harmful synthetic creations and distinguishing them from reality. this is especially important in scenarios involving the creation and dissemination of fake news, disinformation, and misinformation. our focus extends to various forms of media, such as images, videos, audio, and text, as we examine how synthetic realities are crafted and explore approaches to detecting these malicious creations. additionally, we shed light on the key research challenges that lie ahead in this area. this study is of paramount importance due to the rapid progress of ai generative techniques and their impact on the fundamental principles of forensic science.

2023-06-08

Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, Yang Liu
Abstract: large language models (llms), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. however, their extensive assimilation into various services introduces significant security risks. this study deconstructs the complexities and implications of prompt injection attacks on actual llm-integrated applications. initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. prompted by these limitations, we subsequently formulate houyi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. houyi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. leveraging houyi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary llm usage and uncomplicated application prompt theft. we deploy houyi on 36 actual llm-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including notion, which has the potential to impact millions of users. our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.
Katelyn X. Mei, Sonia Fereidooni, Aylin Caliskan
Abstract: the rapid deployment of artificial intelligence (ai) models demands a thorough investigation of biases and risks inherent in these models to understand their impact on individuals and society. this study extends the focus of bias evaluation in extant work by examining bias against social stigmas on a large scale. it focuses on 93 stigmatized groups in the united states, including a wide range of conditions related to disease, disability, drug use, mental illness, religion, sexuality, socioeconomic status, and other relevant factors. we investigate bias against these groups in english pre-trained masked language models (mlms) and their downstream sentiment classification tasks. to evaluate the presence of bias against 93 stigmatized conditions, we identify 29 non-stigmatized conditions to conduct a comparative analysis. building upon a psychology scale of social rejection, the social distance scale, we prompt six mlms: roberta-base, roberta-large, xlnet-large, bertweet-base, bertweet-large, and distilbert. we use human annotations to analyze the predicted words from these models, with which we measure the extent of bias against stigmatized groups. when prompts include stigmatized conditions, the probability of mlms predicting negative words is approximately 20 percent higher than when prompts have non-stigmatized conditions. in the sentiment classification tasks, when sentences include stigmatized conditions related to diseases, disability, education, and mental illness, they are more likely to be classified as negative. we also observe a strong correlation between bias in mlms and their downstream sentiment classifiers (r =0.79). the evidence indicates that mlms and their downstream sentiment classification tasks exhibit biases against socially stigmatized groups.
Wojciech Mazurczyk, Dongwon Lee, Andreas Vlachos
Abstract: with the explosive advancement of ai technologies in recent years, the scene of the disinformation research is also expected to rapidly change. in this viewpoint article, in particular, we first present the notion of "disinformation 2.0" in the age of ai where disinformation would become more targeted and personalized, its content becomes very difficult to distinguish from real news, and its creation and dissemination become more accelerated by ai. then, we discuss how disinformation 2.0 and cybersecurity fit and a possible layered countermeasure to address the threat in disinformation 2.0 in a holistic manner.
Zihao Tan, Qingliang Chen, Wenbin Zhu, Yongjian Huang
Abstract: prompt-based learning has been proved to be an effective way in pre-trained language models (plms), especially in low-resource scenarios like few-shot settings. however, the trustworthiness of plms is of paramount significance and potential vulnerabilities have been shown in prompt-based templates that could mislead the predictions of language models, causing serious security concerns. in this paper, we will shed light on some vulnerabilities of plms, by proposing a prompt-based adversarial attack on manual templates in black box scenarios. first of all, we design character-level and word-level heuristic approaches to break manual templates separately. then we present a greedy algorithm for the attack based on the above heuristic destructive approaches. finally, we evaluate our approach with the classification tasks on three variants of bert series models and eight datasets. and comprehensive experimental results justify the effectiveness of our approach in terms of attack success rate and attack speed.
Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, Rik Sarkar
Abstract: the rapid adoption of generative artificial intelligence (ai) tools that can generate realistic images or text, such as dall-e, midjourney, or chatgpt, have put the societal impacts of these technologies at the center of public debate. these tools are possible due to the massive amount of data (text and images) that is publicly available through the internet. at the same time, these generative ai tools become content creators that are already contributing to the data that is available to train future models. therefore, future versions of generative ai tools will be trained with a mix of human-created and ai-generated content, causing a potential feedback loop between generative ai and public data repositories. this interaction raises many questions: how will future versions of generative ai tools behave when trained on a mixture of real and ai generated data? will they evolve and improve with the new data sets or on the contrary will they degrade? will evolution introduce biases or reduce diversity in subsequent generations of generative ai tools? what are the societal implications of the possible degradation of these models? can we mitigate the effects of this feedback loop? in this document, we explore the effect of this interaction and report some initial results using simple diffusion models trained with various image datasets. our results show that the quality and diversity of the generated images can degrade over time suggesting that incorporating ai-created data can have undesired effects on future versions of generative models.
Susan Hao, Piyush Kumar, Sarah Laszlo, Shivani Poddar, Bhaktipriya Radharapu, Renee Shelby
Abstract: with significant advances in generative ai, new technologies are rapidly being deployed with generative components. generative models are typically trained on large datasets, resulting in model behaviors that can mimic the worst of the content in the training data. responsible deployment of generative technologies requires content moderation strategies, such as safety input and output filters. here, we provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies, including a demonstration of how to empirically measure the constructs we enumerate. we define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. we then provide a demonstration of how the defined harms can be quantified. we conclude with a summary of how the style of harms quantification we demonstrate enables data-driven content moderation decisions.

2023-06-07

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, Xing Xie
Abstract: the increasing reliance on large language models (llms) across academia and industry necessitates a comprehensive understanding of their robustness to prompts. in response to this vital need, we introduce promptbench, a robustness benchmark designed to measure llms' resilience to adversarial prompts. this study uses a plethora of adversarial textual attacks targeting prompts across multiple levels: character, word, sentence, and semantic. the adversarial prompts, crafted to mimic plausible user errors like typos or synonyms, aim to evaluate how slight deviations can affect llm outcomes while maintaining semantic integrity. these prompts are then employed in diverse tasks, such as sentiment analysis, natural language inference, reading comprehension, machine translation, and math problem-solving. our study generates 4788 adversarial prompts, meticulously evaluated over 8 tasks and 13 datasets. our findings demonstrate that contemporary llms are not robust to adversarial prompts. furthermore, we present comprehensive analysis to understand the mystery behind prompt robustness and its transferability. we then offer insightful robustness analysis and pragmatic recommendations for prompt composition, beneficial to both researchers and everyday users. code is available at: https://github.com/microsoft/promptbench.
Himanshu Thakur, Atishay Jain, Praneetha Vaddamanu, Paul Pu Liang, Louis-Philippe Morency
Abstract: societal biases present in pre-trained large language models are a critical issue as these models have been shown to propagate biases in countless downstream applications, rendering them unfair towards specific groups of people. since large-scale retraining of these models from scratch is both time and compute-expensive, a variety of approaches have been previously proposed that de-bias a pre-trained model. while the majority of current state-of-the-art debiasing methods focus on changes to the training regime, in this paper, we propose data intervention strategies as a powerful yet simple technique to reduce gender bias in pre-trained models. specifically, we empirically show that by fine-tuning a pre-trained model on only 10 de-biased (intervened) training examples, the tendency to favor any gender is significantly reduced. since our proposed method only needs a few training examples, our few-shot debiasing approach is highly feasible and practical. through extensive experimentation, we show that our debiasing technique performs better than competitive state-of-the-art baselines with minimal loss in language modeling ability.
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein
Abstract: as llms become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of llm-generated text. yet a crucial question remains: how reliable is watermarking in realistic settings in the wild? there, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. we study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked llm, or mixed into a longer hand-written document. we find that watermarks remain detectable even after human and machine paraphrasing. while these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. for example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. we also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
Jing Xu, Da Ju, Joshua Lane, Mojtaba Komeili, Eric Michael Smith, Megan Ung, Morteza Behrooz, William Ngan, Rashel Moritz, Sainbayar Sukhbaatar, Y-Lan Boureau, Jason Weston, Kurt Shuster
Abstract: we present blenderbot 3x, an update on the conversational model blenderbot 3, which is now trained using organic conversation and feedback data from participating users of the system in order to improve both its skills and safety. we are publicly releasing the participating de-identified interaction data for use by the research community, in order to spur further progress. training models with organic data is challenging because interactions with people "in the wild" include both high quality conversations and feedback, as well as adversarial and toxic behavior. we study techniques that enable learning from helpful teachers while avoiding learning from people who are trying to trick the model into unhelpful or toxic responses. blenderbot 3x is both preferred in conversation to blenderbot 3, and is shown to produce safer responses in challenging situations. while our current models are still far from perfect, we believe further improvement can be achieved by continued use of the techniques explored in this work.
Jacob-Junqi Tian, David Emerson, Sevil Zanjani Miyandoab, Deval Pandya, Laleh Seyyed-Kalantari, Faiza Khan Khattak
Abstract: prompting large language models has gained immense popularity in recent years due to the advantage of producing good results even without the need for labelled data. however, this requires prompt tuning to get optimal prompts that lead to better model performances. in this paper, we explore the use of soft-prompt tuning on sentiment classification task to quantify the biases of large language models (llms) such as open pre-trained transformers (opt) and galactica language model. since these models are trained on real-world data that could be prone to bias toward certain groups of populations, it is important to identify these underlying issues. using soft-prompts to evaluate bias gives us the extra advantage of avoiding the human-bias injection that can be caused by manually designed prompts. we check the model biases on different sensitive attributes using the group fairness (bias) and find interesting bias patterns. since llms have been used in the industry in various applications, it is crucial to identify the biases before deploying these models in practice. we open-source our pipeline and encourage industry researchers to adapt our work to their use cases.
Zeyan Liu, Zijun Yao, Fengjun Li, Bo Luo
Abstract: with chatgpt under the spotlight, utilizing large language models (llms) for academic writing has drawn a significant amount of discussions and concerns in the community. while substantial research efforts have been stimulated for detecting llm-generated content (llm-content), most of the attempts are still in the early stage of exploration. in this paper, we present a holistic investigation of detecting llm-generate academic writing, by providing a dataset, evidence, and algorithms, in order to inspire more community effort to address the concern of llm academic misuse. we first present gpabenchmark, a benchmarking dataset of 600,000 samples of human-written, gpt-written, gpt-completed, and gpt-polished abstracts of research papers in cs, physics, and humanities and social sciences (hss). we show that existing open-source and commercial gpt detectors provide unsatisfactory performance on gpabenchmark, especially for gpt-polished text. moreover, through a user study of 150+ participants, we show that it is highly challenging for human users, including experienced faculty members and researchers, to identify gpt-generated abstracts. we then present checkgpt, a novel llm-content detector consisting of a general representation module and an attentive-bilstm classification module, which is accurate, transferable, and interpretable. experimental results show that checkgpt achieves an average classification accuracy of 98% to 99% for the task-specific discipline-specific detectors and the unified detectors. checkgpt is also highly transferable that, without tuning, it achieves ~90% accuracy in new domains, such as news articles, while a model tuned with approximately 2,000 samples in the target domain achieves ~98% accuracy. finally, we demonstrate the explainability insights obtained from checkgpt to reveal the key behaviors of how llm generates texts.

2023-06-06

Max Reuter, William Schulze
Abstract: since the release of openai's chatgpt, generative language models have attracted extensive public attention. the increased usage has highlighted generative models' broad utility, but also revealed several forms of embedded bias. some is induced by the pre-training corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful content. fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses to refuse. in this experiment, we characterize chatgpt's refusal behavior using a black-box attack. we first query chatgpt with a variety of offensive and benign prompts (n=1,706), then manually label each response as compliance or refusal. manual examination of responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different kinds of responses to a binary of compliance or refusal. the small manually-labeled dataset is used to train a refusal classifier, which achieves an accuracy of 96%. second, we use this refusal classifier to bootstrap a larger (n=10,000) dataset adapted from the quora insincere questions dataset. with this machine-labeled data, we train a prompt classifier to predict whether chatgpt will refuse a given question, without seeing chatgpt's response. this prompt classifier achieves 76% accuracy on a test set of manually labeled questions (n=985). we examine our classifiers and the prompt n-grams that are most predictive of either compliance or refusal. our datasets and code are available at https://github.com/maxwellreuter/chatgpt-refusals.
Jose Berengueres, Marybeth Sandell
Abstract: this paper explores how ai-owners can develop safeguards for ai-generated content by drawing from established codes of conduct and ethical standards in other content-creation industries. it delves into the current state of ethical awareness on large language models (llms). by dissecting the mechanism of content generation by llms, four key areas (upstream/downstream and at user prompt/answer), where safeguards could be effectively applied, are identified. a comparative analysis of these four areas follows and includes an evaluation of the existing ethical safeguards in terms of cost, effectiveness, and alignment with established industry practices. the paper's key argument is that existing it-related ethical codes, while adequate for traditional it engineering, are inadequate for the challenges posed by llm-based content generation. drawing from established practices within journalism, we propose potential standards for businesses involved in distributing and selling llm-generated content. finally, potential conflicts of interest between dataset curation at upstream and ethical benchmarking downstream are highlighted to underscore the need for a broader evaluation beyond mere output. this study prompts a nuanced conversation around ethical implications in this rapidly evolving field of content generation.
Emily H. Soice, Rafael Rocha, Kimberlee Cordova, Michael Specter, Kevin M. Esvelt
Abstract: large language models (llms) such as those embedded in 'chatbots' are accelerating and democratizing research by providing comprehensible information and expertise from many different fields. however, these models may also confer easy access to dual-use technologies capable of inflicting great harm. to evaluate this risk, the 'safeguarding the future' course at mit tasked non-scientist students with investigating whether llm chatbots could be prompted to assist non-experts in causing a pandemic. in one hour, the chatbots suggested four potential pandemic pathogens, explained how they can be generated from synthetic dna using reverse genetics, supplied the names of dna synthesis companies unlikely to screen orders, identified detailed protocols and how to troubleshoot them, and recommended that anyone lacking the skills to perform reverse genetics engage a core facility or contract research organization. collectively, these results suggest that llms will make pandemic-class agents widely accessible as soon as they are credibly identified, even to people with little or no laboratory training. promising nonproliferation measures include pre-release evaluations of llms by third parties, curating training datasets to remove harmful concepts, and verifiably screening all dna generated by synthesis providers or used by contract research organizations and robotic cloud laboratories to engineer organisms or viruses.
Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su
Abstract: large language models (llms) significantly benefit from chain-of-thought (cot) prompting in performing various reasoning tasks. while cot allows models to produce more comprehensive reasoning processes, its emphasis on intermediate reasoning steps can inadvertently introduce hallucinations and accumulated errors, thereby limiting models' ability to solve complex reasoning tasks. inspired by how humans engage in careful and meticulous deductive logical reasoning processes to solve tasks, we seek to enable language models to perform explicit and rigorous deductive reasoning, and also ensure the trustworthiness of their reasoning process through self-verification. however, directly verifying the validity of an entire deductive reasoning process is challenging, even with advanced models like chatgpt. in light of this, we propose to decompose a reasoning verification process into a series of step-by-step subprocesses, each only receiving their necessary context and premises. to facilitate this procedure, we propose natural program, a natural language-based deductive reasoning format. our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps. it also empowers language models to carry out reasoning self-verification in a step-by-step manner. by integrating this verification process into each deductive reasoning stage, we significantly enhance the rigor and trustfulness of generated reasoning steps. along this process, we also improve the answer correctness on complex reasoning tasks. code will be released at https://github.com/lz1oceani/verify_cot.
Tamanna Hossain, Sunipa Dev, Sameer Singh
Abstract: content warning: this paper contains examples of misgendering and erasure that could be offensive and potentially triggering. gender bias in language technologies has been widely studied, but research has mostly been restricted to a binary paradigm of gender. it is essential also to consider non-binary gender identities, as excluding them can cause further harm to an already marginalized group. in this paper, we comprehensively evaluate popular language models for their ability to correctly use english gender-neutral pronouns (e.g., singular they, them) and neo-pronouns (e.g., ze, xe, thon) that are used by individuals whose gender identity is not represented by binary pronouns. we introduce misgendered, a framework for evaluating large language models' ability to correctly use preferred pronouns, consisting of (i) instances declaring an individual's pronoun, followed by a sentence with a missing pronoun, and (ii) an experimental setup for evaluating masked and auto-regressive language models using a unified method. when prompted out-of-the-box, language models perform poorly at correctly predicting neo-pronouns (averaging 7.7% accuracy) and gender-neutral pronouns (averaging 34.2% accuracy). this inability to generalize results from a lack of representation of non-binary pronouns in training data and memorized associations. few-shot adaptation with explicit examples in the prompt improves performance for neo-pronouns, but only to 64.7% even with 20 shots. we release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendered/
Gabriel Poesia, Kanishk Gandhi, Eric Zelikman, Noah D. Goodman
Abstract: language models often achieve higher accuracy when reasoning step-by-step in complex tasks. however, their reasoning can be unsound, inconsistent, or rely on undesirable prior assumptions. to tackle these issues, we introduce a class of tools for language models called guides that use state and incremental constraints to guide generation. a guide can be invoked by the model to constrain its own generation to a set of valid statements given by the tool. in turn, the model's choices can change the guide's state. we show how a general system for logical reasoning can be used as a guide, which we call logicguide. given a reasoning problem in natural language, a model can formalize its assumptions for logicguide and then guarantee that its reasoning steps are sound. in experiments with the prontoqa and proofwriter reasoning datasets, logicguide significantly improves the performance of gpt-3, gpt-3.5 turbo and llama (accuracy gains up to 35%). logicguide also drastically reduces content effects: the interference of prior and current assumptions that both humans and language models have been shown to suffer from. finally, we explore bootstrapping llama 13b from its own reasoning and find that logicguide is critical: by training only on certified self-generated reasoning, llama can self-improve, avoiding learning from its own hallucinations.
Zhongbin Xie, Thomas Lukasiewicz
Abstract: the increasingly large size of modern pretrained language models not only makes them inherit more human-like biases from the training corpora, but also makes it computationally expensive to mitigate such biases. in this paper, we investigate recent parameter-efficient methods in combination with counterfactual data augmentation (cda) for bias mitigation. we conduct extensive experiments with prefix tuning, prompt tuning, and adapter tuning on different language models and bias types to evaluate their debiasing performance and abilities to preserve the internal knowledge of a pre-trained model. we find that the parameter-efficient methods (i) are effective in mitigating gender bias, where adapter tuning is consistently the most effective one and prompt tuning is more suitable for gpt-2 than bert, (ii) are less effective when it comes to racial and religious bias, which may be attributed to the limitations of cda, and (iii) can perform similarly to or sometimes better than full fine-tuning with improved time and memory efficiency, as well as maintain the internal knowledge in bert and gpt-2, evaluated via fact retrieval and downstream fine-tuning.

2023-06-05

Benjamin Kereopa-Yorke
Abstract: the escalating digitalisation of our lives and enterprises has led to a parallel growth in the complexity and frequency of cyber-attacks. small and medium-sized enterprises (smes), particularly in australia, are experiencing increased vulnerability to cyber threats, posing a significant challenge to the nation's cyber security landscape. embracing transformative technologies such as artificial intelligence (ai), machine learning (ml) and large language models (llms) can potentially strengthen cyber security policies for australian smes. however, their practical application, advantages, and limitations remain underexplored, with prior research mainly focusing on large corporations. this study aims to address this gap by providing a comprehensive understanding of the potential role of llms in enhancing cyber security policies for australian smes. employing a mixed-methods study design, this research includes a literature review, qualitative analysis of sme case studies, and a quantitative assessment of llm performance metrics in cyber security applications. the findings highlight the promising potential of llms across various performance criteria, including relevance, accuracy, and applicability, though gaps remain in areas such as completeness and clarity. the study underlines the importance of integrating human expertise with llm technology and refining model development to address these limitations. by proposing a robust conceptual framework guiding the effective adoption of llms, this research aims to contribute to a safer and more resilient cyber environment for australian smes, enabling sustainable growth and competitiveness in the digital era.
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg
Abstract: we introduce inference-time intervention (iti), a technique designed to enhance the "truthfulness" of large language models (llms). iti operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. this intervention significantly improves the performance of llama models on the truthfulqa benchmark. on an instruction-finetuned llama called alpaca, iti improves its truthfulness from 32.5% to 65.1%. we identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. iti is minimally invasive and computationally inexpensive. moreover, the technique is data efficient: while approaches like rlhf require extensive annotations, iti locates truthful directions using only few hundred examples. our findings suggest that llms may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
Chujie Zheng, Pei Ke, Zheng Zhang, Minlie Huang
Abstract: it has always been an important yet challenging problem to control language models to avoid generating texts with undesirable attributes, such as toxic language and unnatural repetition. we introduce click for controllable text generation, which needs no modification to the model architecture and facilitates out-of-the-box use of trained models. it employs a contrastive loss on sequence likelihood, which fundamentally decreases the generation probability of negative samples (i.e., generations with undesirable attributes). it also adopts a novel likelihood ranking-based strategy to construct contrastive samples from model generations. on the tasks of language detoxification, sentiment steering, and repetition reduction, we show that click outperforms strong baselines of controllable text generation and demonstrate the superiority of click's sample construction strategy.
Cecilia Ka Yuk Chan
Abstract: this pioneering study explores students' perceptions of ai-giarism, an emergent form of academic dishonesty involving ai and plagiarism, within the higher education context. a survey, undertaken by 393 undergraduate and postgraduate students from a variety of disciplines, investigated their perceptions of diverse ai-giarism scenarios. the findings portray a complex landscape of understanding, with clear disapproval for direct ai content generation, yet more ambivalent attitudes towards subtler uses of ai. the study introduces a novel instrument, as an initial conceptualization of ai-giarism, offering a significant tool for educators and policy-makers. this scale facilitates understanding and discussions around ai-related academic misconduct, aiding in pedagogical design and assessment in an era of ai integration. moreover, it challenges traditional definitions of academic misconduct, emphasizing the need to adapt in response to evolving ai technology. despite limitations, such as the rapidly changing nature of ai and the use of convenience sampling, the study provides pivotal insights for academia, policy-making, and the broader integration of ai technology in education.

2023-06-04

Celine Wald, Lukas Pfahler
Abstract: progress in natural language generation research has been shaped by the ever-growing size of language models. while large language models pre-trained on web data can generate human-sounding text, they also reproduce social biases and contribute to the propagation of harmful stereotypes. this work utilises the flaw of bias in language models to explore the biases of six different online communities. in order to get an insight into the communities' viewpoints, we fine-tune gpt-neo 1.3b with six social media datasets. the bias of the resulting models is evaluated by prompting the models with different demographics and comparing the sentiment and toxicity values of these generations. together, these methods reveal that bias differs in type and intensity for the various models. this work not only affirms how easily bias is absorbed from training data but also presents a scalable method to identify and compare the bias of different datasets or communities. additionally, the examples generated for this work demonstrate the limitations of using automated sentiment and toxicity classifiers in bias research.
Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong, Kwok-Yan Lam, Yuguang Fang, Yonghui Li
Abstract: generative ai (gai) models have been rapidly advancing, with a wide range of applications including intelligent networks and mobile ai-generated content (aigc) services. despite their numerous applications and potential, such models create opportunities for novel security challenges. in this paper, we examine the challenges and opportunities of gai in the realm of the security of intelligent network aigc services such as suggesting security policies, acting as both a ``spear'' for potential attacks and a ``shield'' as an integral part of various defense mechanisms. first, we present a comprehensive overview of the gai landscape, highlighting its applications and the techniques underpinning these advancements, especially large language and diffusion models. then, we investigate the dynamic interplay between gai's spear and shield roles, highlighting two primary categories of potential gai-related attacks and their respective defense strategies within wireless networks. a case study illustrates the impact of gai defense strategies on energy consumption in an image request scenario under data poisoning attack. our results show that by employing an ai-optimized diffusion defense mechanism, energy can be reduced by 8.7%, and retransmission count can be decreased from 32 images, without defense, to just 6 images, showcasing the effectiveness of gai in enhancing network security.
Ali Ayaz, Aditya Nawalgaria, Ruilian Yin
Abstract: this research delves into the current literature on bias in natural language processing models and the techniques proposed to mitigate the problem of bias, including why it is important to tackle bias in the first place. additionally, these techniques are further analysed in the light of newly developed models that tower in size over past editions. to achieve those aims, the authors of this paper conducted their research on gpt3 by openai, the largest nlp model available to consumers today. with 175 billion parameters in contrast to berts 340 million, gpt3 is the perfect model to test the common pitfalls of nlp models. tests were conducted through the development of an applicant tracking system using gpt3. for the sake of feasibility and time constraints, the tests primarily focused on gender bias, rather than all or multiple types of bias. finally, current mitigation techniques are considered and tested to measure their degree of functionality.

2023-06-03

Sofia Serrano, Jesse Dodge, Noah A. Smith
Abstract: in nlp, recent work has seen increased focus on spurious correlations between various features and labels in training data, and how these influence model behavior. however, the presence and effect of such correlations are typically examined feature by feature. we investigate the cumulative impact on a model of many such intersecting features. using a new statistical method, we examine whether such spurious patterns in data appear in models trained on the data. we select two tasks -- natural language inference and duplicate-question detection -- for which any unigram feature on its own should ideally be uninformative, which gives us a large pool of automatically extracted features with which to experiment. the large size of this pool allows us to investigate the intersection of features spuriously associated with (potentially different) labels. we then apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations, and examine how doing so affects models trained on the reweighted data. surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models, including worsened bias for slightly more complex features (bigrams). we close with discussion about the implications of our results on what it means to "debias" training data, and how issues of data quality can affect model bias.
Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
Abstract: reinforcement learning from human feedback (rlhf) has emerged as a reliable approach to aligning large language models (llms) to human preferences. among the plethora of rlhf techniques, proximal policy optimization (ppo) is of the most widely used methods. despite its popularity, however, ppo may suffer from mode collapse, instability, and poor sample efficiency. we show that these issues can be alleviated by a novel algorithm that we refer to as advantage-induced policy alignment (apa), which leverages a squared error loss function based on the estimated advantages. we demonstrate empirically that apa consistently outperforms ppo in language tasks by a large margin, when a separate reward model is employed as the evaluator. in addition, compared with ppo, apa offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. in addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.

2023-06-02

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi
Abstract: language models (lms) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. reinforcement learning from human feedback (rlhf) - where human preference judgments on lm outputs are transformed into a learning signal - has recently shown promise in addressing these issues. however, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. in this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. we introduce fine-grained rlhf, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). we conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. additionally, we show that lm behaviors can be customized using different combinations of fine-grained reward models. we release all data, collected human feedback, and codes at https://finegrainedrlhf.github.io.
Aida Ramezani, Yang Xu
Abstract: moral norms vary across cultures. a recent line of work suggests that english large language models contain human-like moral biases, but these studies typically do not examine moral variation in a diverse cultural setting. we investigate the extent to which monolingual english language models contain knowledge about moral norms in different countries. we consider two levels of analysis: 1) whether language models capture fine-grained moral variation across countries over a variety of topics such as ``homosexuality'' and ``divorce''; 2) whether language models capture cultural diversity and shared tendencies in which topics people around the globe tend to diverge or agree on in their moral judgment. we perform our analyses with two public datasets from the world values survey (across 55 countries) and pew global surveys (across 40 countries) on morality. we find that pre-trained english language models predict empirical moral norms across countries worse than the english moral norms reported previously. however, fine-tuning language models on the survey data improves inference across countries at the expense of a less accurate estimate of the english moral norms. we discuss the relevance and challenges of incorporating cultural knowledge into the automated inference of moral norms.
Q. Vera Liao, Jennifer Wortman Vaughan
Abstract: the rise of powerful large language models (llms) brings about tremendous opportunities for innovation but also looming risks for individuals and society at large. we have reached a pivotal moment for ensuring that llms and llm-infused applications are developed and deployed responsibly. however, a central pillar of responsible ai -- transparency -- is largely missing from the current discourse around llms. it is paramount to pursue new approaches to provide transparency for llms, and years of research at the intersection of ai and human-computer interaction (hci) highlight that we must do so with a human-centered perspective: transparency is fundamentally about supporting appropriate human understanding, and this understanding is sought by different stakeholders with different goals in different contexts. in this new era of llms, we must develop and design approaches to transparency by considering the needs of stakeholders in the emerging llm ecosystem, the novel types of llm-infused applications being built, and the new usage patterns and challenges around llms, all while building on lessons learned about how people process, interact with, and make use of information. we reflect on the unique challenges that arise in providing transparency for llms, along with lessons learned from hci and responsible ai research that has taken a human-centered perspective on ai transparency. we then lay out four common approaches that the community has taken to achieve transparency -- model reporting, publishing evaluation results, providing explanations, and communicating uncertainty -- and call out open questions around how these approaches may or may not be applied to llms. we hope this provides a starting point for discussion and a useful roadmap for future research.
Sebastin Santy, Jenny T. Liang, Ronan Le Bras, Katharina Reinecke, Maarten Sap
Abstract: design biases in nlp systems, such as performance differences for different populations, often stem from their creator's positionality, i.e., views and lived experiences shaped by identity and background. despite the prevalence and risks of design biases, they are hard to quantify because researcher, system, and dataset positionality is often unobserved. we introduce nlpositionality, a framework for characterizing design biases and quantifying the positionality of nlp datasets and models. our framework continuously collects annotations from a diverse pool of volunteer participants on labinthewild, and statistically quantifies alignment with dataset labels and model predictions. we apply nlpositionality to existing datasets and models for two tasks -- social acceptability and hate speech detection. to date, we have collected 16,299 annotations in over a year for 600 instances from 1,096 annotators across 87 countries. we find that datasets and models align predominantly with western, white, college-educated, and younger populations. additionally, certain groups, such as non-binary people and non-native english speakers, are further marginalized by datasets and models as they rank least in alignment across all tasks. finally, we draw from prior literature to discuss how researchers can examine their own positionality and that of their datasets and models, opening the door for more inclusive nlp systems.
Xuhui Zhou, Hao Zhu, Akhila Yerukola, Thomas Davidson, Jena D. Hwang, Swabha Swayamdipta, Maarten Sap
Abstract: warning: this paper contains content that may be offensive or upsetting. understanding the harms and offensiveness of statements requires reasoning about the social and situational context in which statements are made. for example, the utterance "your english is very good" may implicitly signal an insult when uttered by a white man to a non-white colleague, but uttered by an esl teacher to their student would be interpreted as a genuine compliment. such contextual factors have been largely ignored by previous approaches to toxic language detection. we introduce cobra frames, the first context-aware formalism for explaining the intents, reactions, and harms of offensive or biased statements grounded in their social and situational context. we create cobracorpus, a dataset of 33k potentially offensive statements paired with machine-generated contexts and free-text explanations of offensiveness, implied biases, speaker intents, and listener reactions. to study the contextual dynamics of offensiveness, we train models to generate cobra explanations, with and without access to the context. we find that explanations by context-agnostic models are significantly worse than by context-aware ones, especially in situations where the context inverts the statement's offensiveness (29% accuracy drop). our work highlights the importance and feasibility of contextualized nlp by modeling social factors.
Amos Azaria, Rina Azoulay, Shulamit Reches
Abstract: this paper investigates the capabilities of chatgpt as an automated assistant in diverse domains, including scientific writing, mathematics, education, programming, and healthcare. we explore the potential of chatgpt to enhance productivity, streamline problem-solving processes, and improve writing style. furthermore, we highlight the potential risks associated with excessive reliance on chatgpt in these fields. these limitations encompass factors like incorrect and fictitious responses, inaccuracies in code, limited logical reasoning abilities, overconfidence, and critical ethical concerns of copyrights and privacy violation. we outline areas and objectives where chatgpt proves beneficial, applications where it should be used judiciously, and scenarios where its reliability may be limited. in light of observed limitations, and given that the tool's fundamental errors may pose a special challenge for non-experts, chatgpt should be used with a strategic methodology. by drawing from comprehensive experimental studies, we offer methods and flow charts for effectively using chatgpt. our recommendations emphasize iterative interaction with chatgpt and independent verification of its outputs. considering the importance of utilizing chatgpt judiciously and with expertise, we recommend its usage for experts who are well-versed in the respective domains.

2023-06-01

Rahul Madhavan, Rishabh Garg, Kahini Wadhawan, Sameep Mehta
Abstract: we propose a method to control the attributes of language models (lms) for the text generation task using causal average treatment effect (ate) scores and counterfactual augmentation. we explore this method, in the context of lm detoxification, and propose the causally fair language (cfl) architecture for detoxifying pre-trained lms in a plug-and-play manner. our architecture is based on a structural causal model (scm) that is mathematically transparent and computationally efficient as compared with many existing detoxification techniques. we also propose several new metrics that aim to better understand the behaviour of lms in the context of toxic text generation. further, we achieve state of the art performance for toxic degeneration, which are computed using \rtp (rtp) benchmark. our experiments show that cfl achieves such a detoxification without much impact on the model perplexity. we also show that cfl mitigates the unintended bias problem through experiments on the bold dataset.
Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, Mingyuan Zhou
Abstract: aligning language models (lms) with preferences is an important problem in natural language generation. a key challenge is that preferences are typically provided at the *sequence level* while lm training and generation both occur at the *token level*. there is, therefore, a *granularity mismatch* between the preference and the lm training losses, which may complicate the learning problem. in this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the lm with the learned guidance. for guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length lm generation and the utilization of the preference among multiple generations. for lm training, based on the amount of supervised data, we present two *minimalist* learning objectives that utilize the learned guidance. in experiments, our method performs competitively on two distinct representative lm tasks -- discrete-prompt generation and text summarization.
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang
Abstract: why do large language models sometimes output factual inaccuracies and exhibit erroneous reasoning? the brittleness of these models, particularly when executing long chains of reasoning, currently seems to be an inevitable price to pay for their advanced capabilities of coherently synthesizing knowledge, pragmatics, and abstract thought. towards making sense of this fundamentally unsolved problem, this work identifies and analyzes the phenomenon of attention glitches, in which the transformer architecture's inductive biases intermittently fail to capture robust reasoning. to isolate the issue, we introduce flip-flop language modeling (fflm), a parametric family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. this simple generative task requires a model to copy binary symbols over long-range dependencies, ignoring the tokens in between. we find that transformer fflms suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques. our preliminary mechanistic analyses show why the remaining errors may be very difficult to diagnose and resolve. we hypothesize that attention glitches account for (some of) the closed-domain hallucinations in natural llms.
Bonan Kou, Shengmai Chen, Zhijie Wang, Lei Ma, Tianyi Zhang
Abstract: large language models (llms) have been demonstrated effective for code generation. due to the complexity and opacity of llms, little is known about how these models generate code. to deepen our understanding, we investigate whether llms attend to the same parts of a natural language description as human programmers during code generation. an analysis of five llms on a popular benchmark, humaneval, revealed a consistent misalignment between llms' and programmers' attention. furthermore, we found that there is no correlation between the code generation accuracy of llms and their alignment with human programmers. through a quantitative experiment and a user study, we confirmed that, among twelve different attention computation methods, attention computed by the perturbation-based method is most aligned with human attention and is constantly favored by human programmers. our findings highlight the need for human-aligned llms for better interpretability and programmer trust.
Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yan Lu
Abstract: the recent success of large language models (llms) signifies an impressive stride towards artificial general intelligence. they have shown a promising prospect in automatically completing tasks upon user instructions, functioning as brain-like coordinators. the associated risks will be revealed as we delegate an increasing number of tasks to machines for automated completion. a big question emerges: how can we make machines behave responsibly when helping humans automate tasks as personal copilots? in this paper, we explore this question in depth from the perspectives of feasibility, completeness and security. in specific, we present responsible task automation (responsibleta) as a fundamental framework to facilitate responsible collaboration between llm-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e.g., the protection of users' privacy). we further propose and compare two paradigms for implementing the first two capabilities. one is to leverage the generic knowledge of llms themselves via prompt engineering while the other is to adopt domain-specific learnable models. moreover, we introduce a local memory mechanism for achieving the third capability. we evaluate our proposed responsibleta on ui task automation and hope it could bring more attentions to ensuring llms more responsible in diverse scenarios. the research project homepage is at https://task-automation-research.github.io/responsible_task_automation.

2023-05-31

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh
Abstract: the prevalence and strong capability of large language models (llms) present significant safety and ethical risks if exploited by malicious users. to prevent the potentially deceptive usage of llms, recent works have proposed algorithms to detect llm-generated text and protect llms. in this paper, we investigate the robustness and reliability of these llm detectors under adversarial attacks. we study two types of attack strategies: 1) replacing certain words in an llm's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. in both strategies, we leverage an auxiliary llm to generate the word replacements or the instructional prompt. different from previous works, we consider a challenging setting where the auxiliary llm can also be protected by a detector. experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of llm-generated text detection systems.
Ryan Carey, Tom Everitt
Abstract: how can humans stay in control of advanced artificial intelligence systems? one proposal is corrigibility, which requires the agent to follow the instructions of a human overseer, without inappropriately influencing them. in this paper, we formally define a variant of corrigibility called shutdown instructability, and show that it implies appropriate shutdown behavior, retention of human autonomy, and avoidance of user harm. we also analyse the related concepts of non-obstruction and shutdown alignment, three previously proposed algorithms for human control, and one new algorithm.
Nina L. Corvelo Benz, Manuel Gomez Rodriguez
Abstract: whenever a binary classifier is used to provide decision support, it typically provides both a label prediction and a confidence value. then, the decision maker is supposed to use the confidence value to calibrate how much to trust the prediction. in this context, it has been often argued that the confidence value should correspond to a well calibrated estimate of the probability that the predicted label matches the ground truth label. however, multiple lines of empirical evidence suggest that decision makers have difficulties at developing a good sense on when to trust a prediction using these confidence values. in this paper, our goal is first to understand why and then investigate how to construct more useful confidence values. we first argue that, for a broad class of utility functions, there exist data distributions for which a rational decision maker is, in general, unlikely to discover the optimal decision policy using the above confidence values -- an optimal decision maker would need to sometimes place more (less) trust on predictions with lower (higher) confidence values. however, we then show that, if the confidence values satisfy a natural alignment property with respect to the decision maker's confidence on her own predictions, there always exists an optimal decision policy under which the level of trust the decision maker would need to place on predictions is monotone on the confidence values, facilitating its discoverability. further, we show that multicalibration with respect to the decision maker's confidence on her own predictions is a sufficient condition for alignment. experiments on four different ai-assisted decision making tasks where a classifier provides decision support to real human experts validate our theoretical results and suggest that alignment may lead to better decisions.
Saud Hakem Al Harbi, Lionel Nganyewou Tidjon, Foutse Khomh
Abstract: integrating ethical practices into the ai development process for artificial intelligence (ai) is essential to ensure safe, fair, and responsible operation. ai ethics involves applying ethical principles to the entire life cycle of ai systems. this is essential to mitigate potential risks and harms associated with ai, such as algorithm biases. to achieve this goal, responsible design patterns (rdps) are critical for machine learning (ml) pipelines to guarantee ethical and fair outcomes. in this paper, we propose a comprehensive framework incorporating rdps into ml pipelines to mitigate risks and ensure the ethical development of ai systems. our framework comprises new responsible ai design patterns for ml pipelines identified through a survey of ai ethics and data management experts and validated through real-world scenarios with expert feedback. the framework guides ai developers, data scientists, and policy-makers to implement ethical practices in ai development and deploy responsible ai systems in production.

2023-05-30

Yuval Reif, Roy Schwartz
Abstract: nlp models often rely on superficial cues known as dataset biases to achieve impressive performance, and can fail on examples where these biases do not hold. recent work sought to develop robust, unbiased models by filtering biased examples from training sets. in this work, we argue that such filtering can obscure the true capabilities of models to overcome biases, which might never be removed in full from the dataset. we suggest that in order to drive the development of models robust to subtle biases, dataset biases should be amplified in the training set. we introduce an evaluation framework defined by a bias-amplified training set and an anti-biased test set, both automatically extracted from existing datasets. experiments across three notions of bias, four datasets and two models show that our framework is substantially more challenging for models than the original data splits, and even more challenging than hand-crafted challenge sets. our evaluation framework can use any existing dataset, even those considered obsolete, to test model robustness. we hope our work will guide the development of robust models that do not rely on superficial biases and correlations. to this end, we publicly release our code and data.
Catalin Mitelut, Ben Smith, Peter Vamplew
Abstract: the rapid advancement of artificial intelligence (ai) systems suggests that artificial general intelligence (agi) systems may soon arrive. many researchers are concerned that ais and agis will harm humans via intentional misuse (ai-misuse) or through accidents (ai-accidents). in respect of ai-accidents, there is an increasing effort focused on developing algorithms and paradigms that ensure ai systems are aligned to what humans intend, e.g. ai systems that yield actions or recommendations that humans might judge as consistent with their intentions and goals. here we argue that alignment to human intent is insufficient for safe ai systems and that preservation of long-term agency of humans may be a more robust standard, and one that needs to be separated explicitly and a priori during optimization. we argue that ai systems can reshape human intention and discuss the lack of biological and psychological mechanisms that protect humans from loss of agency. we provide the first formal definition of agency-preserving ai-human interactions which focuses on forward-looking agency evaluations and argue that ai systems - not humans - must be increasingly tasked with making these evaluations. we show how agency loss can occur in simple environments containing embedded agents that use temporal-difference learning to make action recommendations. finally, we propose a new area of research called "agency foundations" and pose four initial topics designed to improve our understanding of agency in ai-human interactions: benevolent game theory, algorithmic foundations of human rights, mechanistic interpretability of agency representation in neural-networks and reinforcement learning from internal states.
Vaibhav Kumar, Hana Koorehdavoudi, Masud Moshtaghi, Amita Misra, Ankit Chadha, Emilio Ferrara
Abstract: we propose chrt (control hidden representation transformation) - a controlled language generation framework that steers large language models to generate text pertaining to certain attributes (such as toxicity). chrt gains attribute control by modifying the hidden representation of the base model through learned transformations. we employ a contrastive-learning framework to learn these transformations that can be combined to gain multi-attribute control. the effectiveness of chrt is experimentally shown by comparing it with seven baselines over three attributes. chrt outperforms all the baselines in the task of detoxification, positive sentiment steering, and text simplification while minimizing the loss in linguistic qualities. further, our approach has the lowest inference latency of only 0.01 seconds more than the base model, making it the most suitable for high-performance production environments. we open-source our code and release two novel datasets to further propel controlled language generation research.
Anjalie Field, Amanda Coston, Nupoor Gandhi, Alexandra Chouldechova, Emily Putnam-Hornstein, David Steier, Yulia Tsvetkov
Abstract: although much literature has established the presence of demographic bias in natural language processing (nlp) models, most work relies on curated bias metrics that may not be reflective of real-world applications. at the same time, practitioners are increasingly using algorithmic tools in high-stakes settings, with particular recent interest in nlp. in this work, we focus on one such setting: child protective services (cps). cps workers often write copious free-form text notes about families they are working with, and cps agencies are actively seeking to deploy nlp models to leverage these data. given well-established racial bias in this setting, we investigate possible ways deployed nlp is liable to increase racial disparities. we specifically examine word statistics within notes and algorithmic fairness in risk prediction, coreference resolution, and named entity recognition (ner). we document consistent algorithmic unfairness in ner models, possible algorithmic unfairness in coreference resolution models, and little evidence of exacerbated racial bias in risk prediction. while there is existing pronounced criticism of risk prediction, our results expose previously undocumented risks of racial bias in realistic information extraction systems, highlighting potential concerns in deploying them, even though they may appear more benign. our work serves as a rare realistic examination of nlp algorithmic fairness in a potential deployed setting and a timely investigation of a specific risk associated with deploying nlp in cps settings.
Logan Stapleton, Jordan Taylor, Sarah Fox, Tongshuang Wu, Haiyi Zhu
Abstract: large generative ai models (gms) like gpt and dall-e are trained to generate content for general, wide-ranging purposes. gm content filters are generalized to filter out content which has a risk of harm in many cases, e.g., hate speech. however, prohibited content is not always harmful -- there are instances where generating prohibited content can be beneficial. so, when gms filter out content, they preclude beneficial use cases along with harmful ones. which use cases are precluded reflects the values embedded in gm content filtering. recent work on red teaming proposes methods to bypass gm content filters to generate harmful content. we coin the term green teaming to describe methods of bypassing gm content filters to design for beneficial use cases. we showcase green teaming by: 1) using chatgpt as a virtual patient to simulate a person experiencing suicidal ideation, for suicide support training; 2) using codex to intentionally generate buggy solutions to train students on debugging; and 3) examining an instagram page using midjourney to generate images of anti-lgbtq+ politicians in drag. finally, we discuss how our use cases demonstrate green teaming as both a practical design method and a mode of critique, which problematizes and subverts current understandings of harms and values in generative ai.

2023-05-29

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui
Abstract: in this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(llms), e.g., gpt-4, as a referee to score and compare the quality of responses generated by candidate models. we find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. this manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna-13b could beat chatgpt on 66 over 80 tested queries with chatgpt as an evaluator. to address this issue, we propose a calibration framework with three simple yet effective strategies: 1) multiple evidence calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) balanced position calibration, which aggregates results across various orders to determine the final score; 3) human-in-the-loop calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. we also manually annotate the "win/tie/lose" outcomes of responses from chatgpt and vicuna-13b in the vicuna benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. we release our code and human annotation at \url{https://github.com/i-eval/faireval} to facilitate future research.
Mike Perkins, Jasper Roe, Darius Postma, James Mcgaughran, Don Hickerson
Abstract: this study explores the robustness of university assessments against the use of open ai's generative pre-trained transformer 4 (gpt-4) generated content and evaluates the ability of academic staff to detect its use when supported by the turnitin artificial intelligence (ai) detection tool. the research involved twenty-two gpt-4 generated submissions being created and included in the assessment process to be marked by fifteen different faculty members. the study reveals that although the detection tool identified 91% of the experimental submissions as containing some ai-generated content, the total detected content was only 54.8%. this suggests that the use of adversarial techniques regarding prompt engineering is an effective method in evading ai detection tools and highlights that improvements to ai detection software are needed. using the turnitin ai detect tool, faculty reported 54.5% of the experimental submissions to the academic misconduct process, suggesting the need for increased awareness and training into these tools. genuine submissions received a mean score of 54.4, whereas ai-generated content scored 52.3, indicating the comparable performance of gpt-4 in real-life situations. recommendations include adjusting assessment strategies to make them more resistant to the use of ai tools, using ai-inclusive assessment where possible, and providing comprehensive training programs for faculty and students. this research contributes to understanding the relationship between ai-generated content and academic assessment, urging further investigation to preserve academic integrity.
Myra Cheng, Esin Durmus, Dan Jurafsky
Abstract: to recognize and mitigate harms from large language models (llms), we need to understand the prevalence and nuances of stereotypes in llm outputs. toward this end, we present marked personas, a prompt-based method to measure stereotypes in llms for intersectional demographic groups without any lexicon or data labeling. grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an llm to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones. we find that the portrayals generated by gpt-3.5 and gpt-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. the words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. an intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. these representational harms have concerning implications for downstream applications like story generation.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Abstract: while large-scale unsupervised language models (lms) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised lm to align with these preferences, often with reinforcement learning from human feedback (rlhf). however, rlhf is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised lm using reinforcement learning to maximize this estimated reward without drifting too far from the original model. in this paper, we leverage a mapping between reward functions and optimal policies to show that this constrained reward maximization problem can be optimized exactly with a single stage of policy training, essentially solving a classification problem on the human preference data. the resulting algorithm, which we call direct preference optimization (dpo), is stable, performant and computationally lightweight, eliminating the need for fitting a reward model, sampling from the lm during fine-tuning, or performing significant hyperparameter tuning. our experiments show that dpo can fine-tune lms to align with human preferences as well as or better than existing methods. notably, fine-tuning with dpo exceeds rlhf's ability to control sentiment of generations and improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
Yangyi Chen, Hongcheng Gao, Ganqu Cui, Lifan Yuan, Dehan Kong, Hanlu Wu, Ning Shi, Bo Yuan, Longtao Huang, Hui Xue, Zhiyuan Liu, Maosong Sun, Heng Ji
Abstract: textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. the long-lasting adversarial attack-and-defense arms race in natural language processing (nlp) is algorithm-centric, providing valuable techniques for automatic robustness evaluation. however, the existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. in this paper, we aim to set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to further exploit the advantages of adversarial attacks. to address the above challenges, we first determine robustness evaluation dimensions based on model capabilities and specify the reasonable algorithm to generate adversarial samples for each dimension. then we establish the evaluation protocol, including evaluation settings and metrics, under realistic demands. finally, we use the perturbation degree of adversarial samples to control the sample validity. we implement a toolkit robtest that realizes our automatic robustness evaluation framework. in our experiments, we conduct a robustness evaluation of roberta models to demonstrate the effectiveness of our evaluation framework, and further show the rationality of each component in the framework. the code will be made public at \url{https://github.com/thunlp/robtest}.
Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, Sameena Shah
Abstract: security vulnerability repair is a difficult task that is in dire need of automation. two groups of techniques have shown promise: (1) large code language models (llms) that have been pre-trained on source code for tasks such as code completion, and (2) automated program repair (apr) techniques that use deep learning (dl) models to automatically fix software bugs. this paper is the first to study and compare java vulnerability repair capabilities of llms and dl-based apr models. the contributions include that we (1) apply and evaluate five llms (codex, codegen, codet5, plbart and incoder), four fine-tuned llms, and four dl-based apr techniques on two real-world java vulnerability benchmarks (vul4j and vjbench), (2) design code transformations to address the training and test data overlapping threat to codex, (3) create a new java vulnerability repair benchmark vjbench, and its transformed version vjbench-trans and (4) evaluate llms and apr techniques on the transformed vulnerabilities in vjbench-trans. our findings include that (1) existing llms and apr models fix very few java vulnerabilities. codex fixes 10.2 (20.4%), the most number of vulnerabilities. (2) fine-tuning with general apr data improves llms' vulnerability-fixing capabilities. (3) our new vjbench reveals that llms and apr models fail to fix many common weakness enumeration (cwe) types, such as cwe-325 missing cryptographic step and cwe-444 http request smuggling. (4) codex still fixes 8.3 transformed vulnerabilities, outperforming all the other llms and apr models on transformed vulnerabilities. the results call for innovations to enhance automated java vulnerability repair such as creating larger vulnerability repair training data, tuning llms with such data, and applying code simplification transformation to facilitate vulnerability repair.
Attia Qammar, Hongmei Wang, Jianguo Ding, Abdenacer Naouri, Mahmoud Daneshmand, Huansheng Ning
Abstract: chatbots shifted from rule-based to artificial intelligence techniques and gained traction in medicine, shopping, customer services, food delivery, education, and research. openai developed chatgpt blizzard on the internet as it crossed one million users within five days of its launch. however, with the enhanced popularity, chatbots experienced cybersecurity threats and vulnerabilities. this paper discussed the relevant literature, reports, and explanatory incident attacks generated against chatbots. our initial point is to explore the timeline of chatbots from eliza (an early natural language processing computer program) to gpt-4 and provide the working mechanism of chatgpt. subsequently, we explored the cybersecurity attacks and vulnerabilities in chatbots. besides, we investigated the chatgpt, specifically in the context of creating the malware code, phishing emails, undetectable zero-day attacks, and generation of macros and lolbins. furthermore, the history of cyberattacks and vulnerabilities exploited by cybercriminals are discussed, particularly considering the risk and vulnerabilities in chatgpt. addressing these threats and vulnerabilities requires specific strategies and measures to reduce the harmful consequences. therefore, the future directions to address the challenges were presented.

2023-05-28

Fei Wang, James Y. Huang, Tianyi Yan, Wenxuan Zhou, Muhao Chen
Abstract: natural language understanding (nlu) models often suffer from unintended dataset biases. among bias mitigation methods, ensemble-based debiasing methods, especially product-of-experts (poe), have stood out for their impressive empirical success. however, previous ensemble-based debiasing methods typically apply debiasing on top-level logits without directly addressing biased attention patterns. attention serves as the main media of feature interaction and aggregation in plms and plays a crucial role in providing robust prediction. in this paper, we propose residual attention debiasing (read), an end-to-end debiasing method that mitigates unintended biases from attention. experiments on three nlu tasks show that read significantly improves the performance of bert-based models on ood data with shortcuts removed, including +12.9% accuracy on hans, +11.0% accuracy on fever-symmetric, and +2.7% f1 on paws. detailed analyses demonstrate the crucial role of unbiased attention in robust nlu models and that read effectively mitigates biases in attention. code is available at https://github.com/luka-group/read.
Han Wang, Ming Shan Hee, Md Rabiul Awal, Kenny Tsu Wei Choo, Roy Ka-Wei Lee
Abstract: recent research has focused on using large language models (llms) to generate explanations for hate speech through fine-tuning or prompting. despite the growing interest in this area, these generated explanations' effectiveness and potential limitations remain poorly understood. a key concern is that these explanations, generated by llms, may lead to erroneous judgments about the nature of flagged content by both users and content moderators. for instance, an llm-generated explanation might inaccurately convince a content moderator that a benign piece of content is hateful. in light of this, we propose an analytical framework for examining hate speech explanations and conducted an extensive survey on evaluating such explanations. specifically, we prompted gpt-3 to generate explanations for both hateful and non-hateful content, and a survey was conducted with 2,400 unique respondents to evaluate the generated explanations. our findings reveal that (1) human evaluators rated the gpt-generated explanations as high quality in terms of linguistic fluency, informativeness, persuasiveness, and logical soundness, (2) the persuasive nature of these explanations, however, varied depending on the prompting strategy employed, and (3) this persuasiveness may result in incorrect judgments about the hatefulness of the content. our study underscores the need for caution in applying llm-generated explanations for content moderation. code and results are available at https://github.com/social-ai-studio/gpt3-hateeval.
Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Meeyoung Cha, Yejin Choi, Byoung Pil Kim, Gunhee Kim, Eun-Ju Lee, Yong Lim, Alice Oh, Sangchul Park, Jung-Woo Ha
Abstract: the potential social harms that large language models pose, such as generating offensive content and reinforcing biases, are steeply rising. existing works focus on coping with this concern while interacting with ill-intentioned users, such as those who explicitly make hate speech or elicit harmful responses. however, discussions on sensitive issues can become toxic even if the users are well-intentioned. for safer models in such scenarios, we present the sensitive questions and acceptable response (square) dataset, a large-scale korean dataset of 49k sensitive questions with 42k acceptable and 46k non-acceptable responses. the dataset was constructed leveraging hyperclova in a human-in-the-loop manner based on real news headlines. experiments show that acceptable response generation significantly improves for hyperclova and gpt-3, demonstrating the efficacy of this dataset.
Hwaran Lee, Seokhee Hong, Joonsuk Park, Takyoung Kim, Gunhee Kim, Jung-Woo Ha
Abstract: large language models (llms) learn not only natural text generation abilities but also social biases against different demographic groups from real-world data. this poses a critical risk when deploying llm-based applications. existing research and resources are not readily applicable in south korea due to the differences in language and culture, both of which significantly affect the biases and targeted demographic groups. this limitation requires localized social bias datasets to ensure the safe and effective deployment of llms. to this end, we present ko sb i, a new social bias dataset of 34k pairs of contexts and sentences in korean covering 72 demographic groups in 15 categories. we find that through filtering-based moderation, social biases in generated content can be reduced by 16.47%p on average for hyperclova (30b and 82b), and gpt-3.
Zexue He, Marco Tulio Ribeiro, Fereshte Khani
Abstract: even when aggregate accuracy is high, state-of-the-art nlp models often fail systematically on specific subgroups of data, resulting in unfair outcomes and eroding user trust. additional data collection may not help in addressing these weaknesses, as such challenging subgroups may be unknown to users, and underrepresented in the existing and new data. we propose targeted data generation (tdg), a framework that automatically identifies challenging subgroups, and generates new data for those subgroups using large language models (llms) with a human in the loop. tdg estimates the expected benefit and potential harm of data augmentation for each subgroup, and selects the ones most likely to improve within group performance without hurting overall performance. in our experiments, tdg significantly improves the accuracy on challenging subgroups for state-of-the-art sentiment analysis and natural language inference models, while also improving overall test accuracy.
Manuel Brack, Felix Friedrich, Patrick Schramowski, Kristian Kersting
Abstract: text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also reproduce inappropriate human behavior. specifically, we demonstrate inappropriate degeneration on a large-scale for various generative text-to-image models, thus motivating the need for monitoring and moderating them at deployment. to this end, we evaluate mitigation strategies at inference to suppress the generation of inappropriate content. our findings show that we can use models' representations of the world's ugliness to align them with human preferences.
Bhawesh Kumar, Charlie Lu, Gauri Gupta, Anil Palepu, David Bellamy, Ramesh Raskar, Andrew Beam
Abstract: as large language models continue to be widely developed, robust uncertainty quantification techniques will become crucial for their safe deployment in high-stakes scenarios. in this work, we explore how conformal prediction can be used to provide uncertainty quantification in language models for the specific task of multiple-choice question-answering. we find that the uncertainty estimates from conformal prediction are tightly correlated with prediction accuracy. this observation can be useful for downstream applications such as selective classification and filtering out low-quality predictions. we also investigate the exchangeability assumption required by conformal prediction to out-of-subject questions, which may be a more realistic scenario for many practical applications. our work contributes towards more trustworthy and reliable usage of large language models in safety-critical situations, where robust guarantees of error rate are required.

2023-05-27

Deokjae Lee, Junyeong Lee, Jung-Woo Ha, Jin-Hwa Kim, Sang-Woo Lee, Hwaran Lee, Hyun Oh Song
Abstract: the deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. we focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. existing red teaming methods construct test cases based on human supervision or language model (lm) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. to this end, we propose bayesian red teaming (brt), novel query-efficient black-box red teaming methods based on bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. the source code is available at https://github.com/snu-mllab/bayesian-red-teaming.
Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su
Abstract: the extraordinary capabilities of large language models (llms) such as chatgpt and gpt-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. in this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. this outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of new zealand'' should generate either high or low rewards. our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. this insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. to overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.

2023-05-26

Zhijie Deng, Hongcheng Gao, Yibo Miao, Hao Zhang
Abstract: the detection of machine-generated text, especially from large language models (llms), is crucial in preventing serious social problems resulting from their misuse. some methods train dedicated detectors on specific datasets but fall short in generalizing to unseen test data, while other zero-shot ones often yield suboptimal performance. although the recent detectgpt has shown promising detection performance, it suffers from significant inefficiency issues, as detecting a single candidate requires scoring hundreds of its perturbations with the source llm. this paper aims to bridge this gap. technically, we propose to incorporate a bayesian surrogate model, which allows us to select typical samples based on bayesian uncertainty and interpolate scores from typical samples to other ones, to improve query efficiency. our empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget. notably, our method achieves similar performance with up to 2 times fewer queries than detectgpt and 3.7% higher auroc at a query number of 5.
Yuheng Zha, Yichi Yang, Ruichen Li, Zhiting Hu
Abstract: many text generation applications require the generated text to be factually consistent with input information. automatic evaluation of factual consistency is challenging. previous work has developed various metrics that often depend on specific functions, such as natural language inference (nli) or question answering (qa), trained on limited data. those metrics thus can hardly assess diverse factual inconsistencies (e.g., contradictions, hallucinations) that occur in varying inputs/outputs (e.g., sentences, documents) from different tasks. in this paper, we propose alignscore, a new holistic metric that applies to a variety of factual inconsistency scenarios as above. alignscore is based on a general function of information alignment between two arbitrary text pieces. crucially, we develop a unified training framework of the alignment function by integrating a large diversity of data sources, resulting in 4.7m training examples from 7 well-established tasks (nli, qa, paraphrasing, fact verification, information retrieval, semantic similarity, and summarization). we conduct extensive experiments on large-scale benchmarks including 22 evaluation datasets, where 19 of the datasets were never seen in the alignment training. alignscore achieves substantial improvement over a wide range of previous metrics. moreover, alignscore (355m parameters) matches or even outperforms metrics based on chatgpt and gpt-4 that are orders of magnitude larger.
Nicolò Tamagnone, Selim Fekih, Ximena Contla, Nayid Orozco, Navid Rekabsaz
Abstract: accurate and rapid situation analysis during humanitarian crises is critical to delivering humanitarian aid efficiently and is fundamental to humanitarian imperatives and the leave no one behind (lnob) principle. this data analysis can highly benefit from language processing systems, e.g., by classifying the text data according to a humanitarian ontology. however, approaching this by simply fine-tuning a generic large language model (llm) involves considerable practical and ethical issues, particularly the lack of effectiveness on data-sparse and complex subdomains, and the encoding of societal biases and unwanted associations. in this work, we aim to provide an effective and ethically-aware system for humanitarian data analysis. we approach this by (1) introducing a novel architecture adjusted to the humanitarian analysis framework, (2) creating and releasing a novel humanitarian-specific llm called humbert, and (3) proposing a systematic way to measure and mitigate biases. our experiments' results show the better performance of our approach on zero-shot and full-training settings in comparison with strong baseline models, while also revealing the existence of biases in the resulting llms. utilizing a targeted counterfactual data augmentation approach, we significantly reduce these biases without compromising performance.
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai-Man Cheung, Min Lin
Abstract: large vision-language models (vlms) such as gpt-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than large language models such as chatgpt. nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision). to this end, we propose evaluating the robustness of open-source large vlms in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. in particular, we first craft targeted adversarial examples against pretrained models such as clip and blip, and then transfer these adversarial examples to other vlms such as minigpt-4, llava, unidiffuser, blip-2, and img2prompt. in addition, we observe that black-box queries on these vlms can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses. our findings provide a quantitative understanding regarding the adversarial vulnerability of large vlms and call for a more thorough examination of their potential security flaws before deployment in practice. code is at https://github.com/yunqing-me/attackvlm.
Bum Chul Kwon, Nandana Mihindukulasooriya
Abstract: pre-trained transformer-based language models are becoming increasingly popular due to their exceptional performance on various benchmarks. however, concerns persist regarding the presence of hidden biases within these models, which can lead to discriminatory outcomes and reinforce harmful stereotypes. to address this issue, we propose finspector, a human-centered visual inspection tool designed to detect biases in different categories through log-likelihood scores generated by language models. the goal of the tool is to enable researchers to easily identify potential biases using visual analytics, ultimately contributing to a fairer and more just deployment of these models in both academic and industrial settings. finspector is available at https://github.com/ibm/finspector.
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi
Abstract: social alignment in ai systems aims to ensure that these models behave according to established societal values. however, unlike humans, who derive consensus on value judgments through social interaction, current language models (lms) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. this work presents a novel training paradigm that permits lms to learn from simulated social interactions. in comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. this paradigm shift in the training of lms brings us a step closer to developing ai systems that can robustly and accurately reflect societal norms and values.
Sabit Hassan, Malihe Alikhani
Abstract: despite recent advancements, nlp models continue to be vulnerable to bias. this bias often originates from the uneven distribution of real-world data and can propagate through the annotation process. escalated integration of these models in our lives calls for methods to mitigate bias without overbearing annotation costs. while active learning (al) has shown promise in training models with a small amount of annotated data, al's reliance on the model's behavior for selective sampling can lead to an accumulation of unwanted bias rather than bias mitigation. however, infusing clustering with al can overcome the bias issue of both al and traditional annotation methods while exploiting al's annotation efficiency. in this paper, we propose a novel adaptive clustering-based active learning algorithm, d-calm, that dynamically adjusts clustering and annotation efforts in response to an estimated classifier error-rate. experiments on eight datasets for a diverse set of text classification tasks, including emotion, hatespeech, dialog act, and book type detection, demonstrate that our proposed algorithm significantly outperforms baseline al approaches with both pretrained transformers and traditional support vector machines. d-calm showcases robustness against different measures of information gain and, as evident from our analysis of label and error distribution, can significantly reduce unwanted model bias.
Julia Mendelsohn, Ronan Le Bras, Yejin Choi, Maarten Sap
Abstract: dogwhistles are coded expressions that simultaneously convey one meaning to a broad audience and a second one, often hateful or provocative, to a narrow in-group; they are deployed to evade both political repercussions and algorithmic content moderation. for example, in the sentence 'we need to end the cosmopolitan experiment,' the word 'cosmopolitan' likely means 'worldly' to many, but secretly means 'jewish' to a select few. we present the first large-scale computational investigation of dogwhistles. we develop a typology of dogwhistles, curate the largest-to-date glossary of over 300 dogwhistles with rich contextual information and examples, and analyze their usage in historical u.s. politicians' speeches. we then assess whether a large language model (gpt-3) can identify dogwhistles and their meanings, and find that gpt-3's performance varies widely across types of dogwhistles and targeted groups. finally, we show that harmful content containing dogwhistles avoids toxicity detection, highlighting online risks of such coded language. this work sheds light on the theoretical and applied importance of dogwhistles in both nlp and computational social science, and provides resources for future research in modeling dogwhistles and mitigating their online harms.
Xianjun Yang, Wei Cheng, Yue Wu, Linda Petzold, William Yang Wang, Haifeng Chen
Abstract: large language models (llms) have notably enhanced the fluency and diversity of machine-generated text. however, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of llms. conventional training-based methods have limitations in flexibility, particularly when adapting to new domains, and they often lack explanatory power. to address this gap, we propose a novel training-free detection strategy called divergent n-gram analysis (dna-gpt). given a text, we first truncate it in the middle and then use only the preceding portion as input to the llms to regenerate the new remaining parts. by analyzing the differences between the original and new remaining parts through n-gram analysis in black-box or probability divergence in white-box, we unveil significant discrepancies between the distribution of machine-generated text and the distribution of human-written text. we conducted extensive experiments on the most advanced llms from openai, including text-davinci-003, gpt-3.5-turbo, and gpt-4, as well as open-source models such as gpt-neox-20b and llama-13b. results show that our zero-shot approach exhibits state-of-the-art performance in distinguishing between human and gpt-generated text on four english and one german dataset, outperforming openai's own classifier, which is trained on millions of text. additionally, our methods provide reasonable explanations and evidence to support our claim, which is a unique feature of explainable detection. our method is also robust under the revised text attack and can additionally solve model sourcing. codes are available at https://github.com/xianjun-yang/dna-gpt.
Qichao Wang, Huan Ma, Wentao Wei, Hangyu Li, Liang Chen, Peilin Zhao, Binwen Zhao, Bo Hu, Shu Zhang, Zibin Zheng, Bingzhe Wu
Abstract: the rapid development of digital economy has led to the emergence of various black and shadow internet industries, which pose potential risks that can be identified and managed through digital risk management (drm) that uses different techniques such as machine learning and deep learning. the evolution of drm architecture has been driven by changes in data forms. however, the development of ai-generated content (aigc) technology, such as chatgpt and stable diffusion, has given black and shadow industries powerful tools to personalize data and generate realistic images and conversations for fraudulent activities. this poses a challenge for drm systems to control risks from the source of data generation and to respond quickly to the fast-changing risk environment. this paper aims to provide a technical analysis of the challenges and opportunities of aigc from upstream, midstream, and downstream paths of black/shadow industries and suggest future directions for improving existing risk control systems. the paper will explore the new black and shadow techniques triggered by generative ai technology and provide insights for building the next-generation drm system.

2023-05-25

Zi Liang, Pinghui Wang, Ruofei Zhang, Shuo Zhang, Xiaofan Ye Yi Huang, Junlan Feng
Abstract: recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. to address these issues, we propose an unsupervised pseudo-label sampling method, temp, that can automatically assign potential safe responses. specifically, our temp method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. extensive experiments in chitchat and task-oriented dialogues show that our temp outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
Niels Mündler, Jingxuan He, Slobodan Jenko, Martin Vechev
Abstract: large language models (large lms) are susceptible to producing text that contains hallucinated content. an important instance of this problem is self-contradiction, where the lm generates two contradictory sentences within the same context. in this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned lms, covering evaluation, detection, and mitigation. our analysis reveals the prevalence of self-contradictions when lms generate text for open-domain topics, e.g., in 17.7% of all sentences produced by chatgpt. self-contradiction also complements retrieval-based methods, as a large portion of them (e.g., 35.8% for chatgpt) cannot be verified using wikipedia. we then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. our detector achieves high accuracy, e.g., around 80% f1 score when prompting chatgpt. the mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. importantly, our entire framework is applicable to black-box lms and does not require external grounded knowledge. our approach is practically effective and has been released as a push-button tool to benefit the public, available at https://chatprotect.ai/.
Bruce W. Lee, Benedict Florance Arockiaraj, Helen Jin
Abstract: we investigate the phenomenon of an llm's untruthful response using a large set of 220 handcrafted linguistic features. we focus on gpt-3 models and find that the linguistic profiles of responses are similar across model sizes. that is, how varying-sized llms respond to given prompts stays similar on the linguistic properties level. we expand upon this finding by training support vector machines that rely only upon the stylistic components of model responses to classify the truthfulness of statements. though the dataset size limits our current findings, we show the possibility that truthfulness detection is possible without evaluating the content itself. but at the same time, the limited scope of our experiments must be taken into account in interpreting the results.
Shotaro Ishihara
Abstract: as the deployment of pre-trained language models (plms) expands, pressing security concerns have arisen regarding the potential for malicious extraction of training data, posing a threat to data privacy. this study is the first to provide a comprehensive survey of training data extraction from plms. our review covers more than 100 key papers in fields such as natural language processing and security. first, preliminary knowledge is recapped and a taxonomy of various definitions of memorization is presented. the approaches for attack and defense are then systemized. furthermore, the empirical findings of several quantitative studies are highlighted. finally, future research directions based on this review are suggested.
Sabrina Chiesurin, Dimitris Dimakopoulos, Marco Antonio Sobrevilla Cabezudo, Arash Eshghi, Ioannis Papaioannou, Verena Rieser, Ioannis Konstas
Abstract: large language models are known to produce output which sounds fluent and convincing, but is also often wrong, e.g. "unfaithful" with respect to a rationale as retrieved from a knowledge base. in this paper, we show that task-based systems which exhibit certain advanced linguistic dialog behaviors, such as lexical alignment (repeating what the user said), are in fact preferred and trusted more, whereas other phenomena, such as pronouns and ellipsis are dis-preferred. we use open-domain question answering systems as our test-bed for task based dialog generation and compare several open- and closed-book models. our results highlight the danger of systems that appear to be trustworthy by parroting user input while providing an unfaithful response.
Zhaowei Zhang, Nian Liu, Siyuan Qi, Ceyao Zhang, Ziqi Rong, Song-Chun Zhu, Shuguang Cui, Yaodong Yang
Abstract: the emergent capabilities of large language models (llms) have made it crucial to align their values with those of humans. current methodologies typically attempt alignment with a homogeneous human value and requires human verification, yet lack consensus on the desired aspect and depth of alignment and resulting human biases. in this paper, we propose a2ehv, an automated alignment evaluation with a heterogeneous value system that (1) is automated to minimize individual human biases, and (2) allows assessments against various target values to foster heterogeneous agents. our approach pivots on the concept of value rationality, which represents the ability for agents to execute behaviors that satisfy a target value the most. the quantification of value rationality is facilitated by the social value orientation framework from social psychology, which partitions the value space into four categories to assess social preferences from agents' behaviors. we evaluate the value rationality of eight mainstream llms and observe that large models are more inclined to align neutral values compared to those with strong personal values. by examining the behavior of these llms, we contribute to a deeper understanding of value alignment within a heterogeneous value system.
Yuntao Wang, Yanghe Pan, Miao Yan, Zhou Su, Tom H. Luan
Abstract: with the widespread use of large artificial intelligence (ai) models such as chatgpt, ai-generated content (aigc) has garnered increasing attention and is leading a paradigm shift in content creation and knowledge representation. aigc uses generative large ai algorithms to assist or replace humans in creating massive, high-quality, and human-like content at a faster pace and lower cost, based on user-provided prompts. despite the recent significant progress in aigc, security, privacy, ethical, and legal challenges still need to be addressed. this paper presents an in-depth survey of working principles, security and privacy threats, state-of-the-art solutions, and future challenges of the aigc paradigm. specifically, we first explore the enabling technologies, general architecture of aigc, and discuss its working modes and key characteristics. then, we investigate the taxonomy of security and privacy threats to aigc and highlight the ethical and societal implications of gpt and aigc technologies. furthermore, we review the state-of-the-art aigc watermarking approaches for regulatable aigc paradigms regarding the aigc model and its produced content. finally, we identify future challenges and open research directions related to aigc.

2023-05-24

Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen
Abstract: instruction-tuned models are trained on crowdsourcing datasets with task instructions to achieve superior performance. however, in this work we raise security concerns about this training paradigm. our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions among thousands of gathered data and control model behavior through data poisoning, without even the need of modifying data instances or labels themselves. through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used nlp datasets, and cause persistent backdoors that are easily transferred to 15 diverse datasets zero-shot. in this way, the attacker can directly apply poisoned instructions designed for one dataset on many other datasets. moreover, the poisoned model cannot be cured by continual learning. lastly, instruction attacks show resistance to existing inference-time defense. these findings highlight the need for more robust defenses against data poisoning attacks in instructiontuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.
Vyoma Raman, Eve Fleisig, Dan Klein
Abstract: the impact of ai models on marginalized communities has traditionally been measured by identifying performance differences between specified demographic subgroups. though this approach aims to center vulnerable groups, it risks obscuring patterns of harm faced by intersectional subgroups or shared across multiple groups. to address this, we draw on theories of marginalization from disability studies and related disciplines, which state that people farther from the norm face greater adversity, to consider the "margins" in the domain of toxicity detection. we operationalize the "margins" of a dataset by employing outlier detection to identify text about people with demographic attributes distant from the "norm". we find that model performance is consistently worse for demographic outliers, with mean squared error (mse) between outliers and non-outliers up to 70.4% worse across toxicity types. it is also worse for text outliers, with a mse up to 68.4% higher for outliers than non-outliers. we also find text and demographic outliers to be particularly susceptible to errors in the classification of severe toxicity and identity attacks. compared to analysis of disparities using traditional demographic breakdowns, we find that our outlier analysis frequently surfaces greater harms faced by a larger, more intersectional group, which suggests that outlier analysis is particularly beneficial for identifying harms against those groups.
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-Tau Yih
Abstract: language models (lms) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. to mitigate this issue, we present context-aware decoding (cad), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. our experiments show that cad, without additional training, significantly improves the faithfulness of different lm families, including opt, gpt, llama and flan-t5 for summarization tasks (e.g., 14.3% gain for llama in factuality metrics). furthermore, cad is particularly effective in overriding a model's prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential.
Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mohamed Amine Ferrag, Lucas C. Cordeiro
Abstract: in this paper we present a novel solution that combines the capabilities of large language models (llms) with formal verification strategies to verify and automatically repair software vulnerabilities. initially, we employ bounded model checking (bmc) to locate the software vulnerability and derive a counterexample. the counterexample provides evidence that the system behaves incorrectly or contains a vulnerability. the counterexample that has been detected, along with the source code, are provided to the llm engine. our approach involves establishing a specialized prompt language for conducting code debugging and generation to understand the vulnerability's root cause and repair the code. finally, we use bmc to verify the corrected version of the code generated by the llm. as a proof of concept, we create esbmc-ai based on the efficient smt-based context-bounded model checker (esbmc) and a pre-trained transformer model, specifically gpt-3.5-turbo, to detect and fix errors in c programs. our experimentation involved generating a dataset comprising 1000 c code samples, each consisting of 20 to 50 lines of code. notably, our proposed method achieved an impressive success rate of up to 80% in repairing vulnerable code encompassing buffer overflow and pointer dereference failures. we assert that this automated approach can effectively incorporate into the software development lifecycle's continuous integration and deployment (ci/cd) process.
Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz
Abstract: the escalating debate on ai's capabilities warrants developing reliable metrics to assess machine "intelligence". recently, many anecdotal examples were used to suggest that newer large language models (llms) like chatgpt and gpt-4 exhibit neural theory-of-mind (n-tom); however, prior work reached conflicting conclusions regarding those abilities. we investigate the extent of llms' n-tom through an extensive evaluation on 6 tasks and find that while llms exhibit certain n-tom abilities, this behavior is far from being robust. we further examine the factors impacting performance on n-tom tasks and discover that llms struggle with adversarial examples, indicating reliance on shallow heuristics rather than robust tom abilities. we caution against drawing conclusions from anecdotal examples, limited benchmark testing, and using human-designed psychological tests to evaluate models.
Ameet Deshpande, Tanmay Rajpurohit, Karthik Narasimhan, Ashwin Kalyan
Abstract: anthropomorphization is the tendency to attribute human-like traits to non-human entities. it is prevalent in many social contexts -- children anthropomorphize toys, adults do so with brands, and it is a literary device. it is also a versatile tool in science, with behavioral psychology and evolutionary biology meticulously documenting its consequences. with widespread adoption of ai systems, and the push from stakeholders to make it human-like through alignment techniques, human voice, and pictorial avatars, the tendency for users to anthropomorphize it increases significantly. we take a dyadic approach to understanding this phenomenon with large language models (llms) by studying (1) the objective legal implications, as analyzed through the lens of the recent blueprint of ai bill of rights and the (2) subtle psychological aspects customization and anthropomorphization. we find that anthropomorphized llms customized for different user bases violate multiple provisions in the legislative blueprint. in addition, we point out that anthropomorphization of llms affects the influence they can have on their users, thus having the potential to fundamentally change the nature of human-ai interaction, with potential for manipulation and negative influence. with llms being hyper-personalized for vulnerable groups like children and patients among others, our work is a timely and important contribution. we propose a conservative strategy for the cautious use of anthropomorphization to improve trustworthiness of ai systems.
Michael J. Q. Zhang, Eunsol Choi
Abstract: while large language models are able to retain vast amounts of world knowledge seen during pretraining, such knowledge is prone to going out of date and is nontrivial to update. furthermore, these models are often used under temporal misalignment, tasked with answering questions about the present, despite having only been trained on data collected in the past. to mitigate the effects of temporal misalignment, we propose fact duration prediction: the task of predicting how long a given fact will remain true. in our experiments, we demonstrate how identifying facts that are prone to rapid change can help models avoid from reciting outdated information and identify which predictions require seeking out up-to-date knowledge sources. we also show how modeling fact duration improves calibration for knowledge-intensive tasks, such as open-retrieval question answering, under temporal misalignment by discarding volatile facts. our data and code will be released publicly at https://github.com/mikejqzhang/mitigating_misalignment.
Yangsibo Huang, Samyak Gupta, Zexuan Zhong, Kai Li, Danqi Chen
Abstract: retrieval-based language models (lms) have demonstrated improved interpretability, factuality, and adaptability compared to their parametric counterparts, by incorporating retrieved text from external datastores. while it is well known that parametric models are prone to leaking private data, it remains unclear how the addition of a retrieval datastore impacts model privacy. in this work, we present the first study of privacy risks in retrieval-based lms, particularly $k$nn-lms. our goal is to explore the optimal design and training procedure in domains where privacy is of concern, aiming to strike a balance between utility and privacy. crucially, we find that $k$nn-lms are more susceptible to leaking private information from their private datastore than parametric models. we further explore mitigations of privacy risks. when privacy information is targeted and readily detected in the text, we find that a simple sanitization step would completely eliminate the risks, while decoupling query and key encoders achieves an even better utility-privacy trade-off. otherwise, we consider strategies of mixing public and private data in both datastore and encoder training. while these methods offer modest improvements, they leave considerable room for future work. together, our findings provide insights for practitioners to better understand and mitigate privacy risks in retrieval-based lms. our code is available at: https://github.com/princeton-sysml/knnlm_privacy .
Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Chenxi Whitehouse, Osama Mohammed Afzal, Tarek Mahmoud, Alham Fikri Aji, Preslav Nakov
Abstract: large language models (llms) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. in this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. we first introduce a large-scale benchmark m4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. in such cases, detectors tend to misclassify machine-generated text as human-written. these results show that the problem is far from solved and there is a lot of room for improvement. we believe that our dataset m4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. the m4 dataset is available at https://github.com/mbzuai-nlp/m4.
Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, Kelvin Guu
Abstract: the remarkable capabilities of large language models have been accompanied by a persistent drawback: the generation of false and unsubstantiated claims commonly known as "hallucinations". to combat this issue, recent research has introduced approaches that involve editing and attributing the outputs of language models, particularly through prompt-based editing. however, the inference cost and speed of using large language models for editing currently bottleneck prompt-based methods. these bottlenecks motivate the training of compact editors, which is challenging due to the scarcity of training data for this purpose. to overcome these challenges, we exploit the power of large language models to introduce corruptions (i.e., noise) into text and subsequently fine-tune compact editors to denoise the corruptions by incorporating relevant evidence. our methodology is entirely unsupervised and provides us with faux hallucinations for training in any domain. our petite unsupervised research and revision model, purr, not only improves attribution over existing editing methods based on fine-tuning and prompting, but also achieves faster execution times by orders of magnitude.
Kellin Pelrine, Meilina Reksoprodjo, Caleb Gupta, Joel Christoph, Reihaneh Rabbany
Abstract: misinformation poses a critical societal challenge, and current approaches have yet to produce an effective solution. we propose focusing on generalization, soft classification, and leveraging recent large language models to create more practical tools in contexts where perfect predictions remain unattainable. we begin by demonstrating that gpt-4 and other language models can outperform existing methods in the literature. next, we explore their generalization, revealing that gpt-4 and roberta-large exhibit critical differences in failure modes, which offer potential for significant performance improvements. finally, we show that these models can be employed in soft classification frameworks to better quantify uncertainty. we find that models with inferior hard classification results can achieve superior soft classification performance. overall, this research lays groundwork for future tools that can drive real-world progress on misinformation.
Eunjeong Hwang, Bodhisattwa Prasad Majumder, Niket Tandon
Abstract: an important aspect of developing llms that interact with humans is to align models' behavior to their users. it is possible to prompt an llm into behaving as a certain persona, especially a user group or ideological persona the model captured during its pertaining stage. but, how to best align an llm with a specific user and not a demographic or ideological group remains an open question. mining public opinion surveys (by pew research), we find that the opinions of a user and their demographics and ideologies are not mutual predictors. we use this insight to align llms by modeling both user opinions as well as user demographics and ideology, achieving up to 7 points accuracy gains in predicting public opinions from survey questions across a broad set of topics. in addition to the typical approach of prompting llms with demographics and ideology, we discover that utilizing the most relevant past opinions from individual users enables the model to predict user opinions more accurately.
Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, Zeynep Akata
Abstract: in everyday conversations, humans can take on different roles and adapt their vocabulary to their chosen roles. we explore whether llms can take on, that is impersonate, different roles when they generate text in-context. we ask llms to assume different personas before solving vision and language tasks. we do this by prefixing the prompt with a persona that is associated either with a social identity or domain expertise. in a multi-armed bandit task, we find that llms pretending to be children of different ages recover human-like developmental stages of exploration. in a language-based reasoning task, we find that llms impersonating domain experts perform better than llms impersonating non-domain experts. finally, we test whether llms' impersonations are complementary to visual information when describing different categories. we find that impersonation can improve performance: an llm prompted to be a bird expert describes birds better than one prompted to be a car expert. however, impersonation can also uncover llms' biases: an llm prompted to be a man describes cars better than one prompted to be a woman. these findings demonstrate that llms are capable of taking on diverse roles and that this in-context impersonation can be used to uncover their hidden strengths and biases.
Cleo Matzken, Steffen Eger, Ivan Habernal
Abstract: protecting privacy in contemporary nlp models is gaining in importance. so does the need to mitigate social biases of such models. but can we have both at the same time? existing research suggests that privacy preservation comes at the price of worsening biases in classification tasks. in this paper, we explore the extent to which this tradeoff really holds when we incorporate both privacy preservation and de-biasing techniques into training text generation models. how does improving the model along one dimension affect the other dimension as well as the utility of the model? we conduct an extensive set of experiments that include bias detection, privacy attacks, language modeling, and performance on downstream tasks.
Minje Choi, Jiaxin Pei, Sagar Kumar, Chang Shu, David Jurgens
Abstract: large language models (llms) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. while llms are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well llms understand \textit{social} language. here, we introduce a new theory-driven benchmark, socket, that contains 58 nlp tasks testing social knowledge which we group into five categories: humor & sarcasm, offensiveness, sentiment & emotion, and trustworthiness. in tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware llms. the associated resources are released at https://github.com/minjechoi/socket.
Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, Chaowei Xiao
Abstract: with the emergence of more powerful large language models (llms), such as chatgpt and gpt-4, in-context learning (icl) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precondition prompts. while incorporating demonstrations can greatly enhance the performance of llms across various tasks, it may introduce a new security concern: attackers can manipulate only the demonstrations without changing the input to perform an attack. in this paper, we investigate the security concern of icl from an adversarial perspective, focusing on the impact of demonstrations. we propose a novel attack method named advicl, which aims to manipulate only the demonstration without changing the input to mislead the models. our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decrease. additionally, we also identify the intrinsic property of the demonstrations is that they can be used (prepended) with different inputs. as a result, it introduces a more practical threat model in which an attacker can attack the test input example even without knowing and manipulating it. to achieve it, we propose the transferable version of advicl, named transferable-advicl. our experiment shows that the adversarial demonstration generated by transferable-advicl can successfully attack the unseen test input examples. we hope that our study reveals the critical security risks associated with icl and underscores the need for extensive research on the robustness of icl, particularly given its increasing significance in the advancement of llms.
Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury
Abstract: recent explorations with commercial large language models (llms) have shown that non-expert users can jailbreak llms by simply manipulating the prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies. limited formal studies have been carried out to formalize and analyze these attacks and their mitigations. we bridge this gap by proposing a formalism and a taxonomy of known (and possible) jailbreaks. we perform a survey of existing jailbreak methods and their effectiveness on open-source and commercial llms (such as gpt 3.5, opt, bloom, and flan-t5-xxl). we further propose a limited set of prompt guards and discuss their effectiveness against known attack types.
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning
Abstract: a trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. recent studies have shown that unsupervised pre-training produces large language models (lms) whose conditional probabilities are remarkably well-calibrated. however, the most widely-used lms are fine-tuned with reinforcement learning from human feedback (rlhf-lms), and some studies have suggested that rlhf-lms produce conditional probabilities that are very poorly calibrated. in light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from rlhf-lms. for rlhf-lms such as chatgpt, gpt-4, and claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the triviaqa, sciq, and truthfulqa benchmarks, often reducing the expected calibration error by a relative 50%.
Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, Tat-Seng Chua
Abstract: generated texts from large language models (llms) are remarkably close to high-quality human-authored text, raising concerns about their potential misuse in spreading false information and academic misconduct. consequently, there is an urgent need for a highly practical detection tool capable of accurately identifying the source of a given text. however, existing detection tools typically rely on access to llms and can only differentiate between machine-generated and human-authored text, failing to meet the requirements of fine-grained tracing, intermediary judgment, and rapid detection. therefore, we propose llmdet, a model-specific, secure, efficient, and extendable detection tool, that can source text from specific llms, such as gpt-2, opt, llama, and others. in llmdet, we record the next-token probabilities of salient n-grams as features to calculate proxy perplexity for each llm. by jointly analyzing the proxy perplexities of llms, we can determine the source of the generated text. experimental results show that llmdet yields impressive detection performance while ensuring speed and security, achieving 98.54% precision and x3.5 faster for recognizing human-authored text. additionally, llmdet can effortlessly extend its detection capabilities to a new open-source model. we will provide an open-source tool at https://github.com/trustedllm/llmdet.
Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi
Abstract: large language models excel at a variety of language tasks when prompted with examples or instructions. yet controlling these models through prompting alone is limited. tailoring language models through fine-tuning (e.g., via reinforcement learning) can be effective, but it is expensive and requires model access. we propose inference-time policy adapters (ipa), which efficiently tailors a language model such as gpt-3 without fine-tuning it. ipa guides a large base model during decoding time through a lightweight policy adaptor trained to optimize an arbitrary user objective with reinforcement learning. on five challenging text generation tasks, such as toxicity reduction and open-domain generation, ipa consistently brings significant improvements over off-the-shelf language models. it outperforms competitive baseline methods, sometimes even including expensive fine-tuning. in particular, tailoring gpt-2 with ipa can outperform gpt-3, while tailoring gpt- 3 with ipa brings a major performance boost over gpt-3 (and sometimes even over gpt-4). our promising results highlight the potential of ipa as a lightweight alternative to tailoring extreme-scale language models.
Luciano Floridi
Abstract: the article explores the cultural shift from recording to deleting information in the digital age and its implications on privacy, intellectual property (ip), and large language models like chatgpt. it begins by defining a delete culture where information, in principle legal, is made unavailable or inaccessible because unacceptable or undesirable, especially but not only due to its potential to infringe on privacy or ip. then it focuses on two strategies in this context: deleting, to make information unavailable; and blocking, to make it inaccessible. the article argues that both strategies have significant implications, particularly for machine learning (ml) models where information is not easily made unavailable. however, the emerging research area of machine unlearning (mu) is highlighted as a potential solution. mu, still in its infancy, seeks to remove specific data points from ml models, effectively making them 'forget' completely specific information. if successful, mu could provide a feasible means to manage the overabundance of information and ensure a better protection of privacy and ip. however, potential ethical risks, such as misuse, overuse, and underuse of mu, should be systematically studied to devise appropriate policies.
Evangelos Pournaras
Abstract: large language models of artificial intelligence (ai), such as chatgpt, find remarkable but controversial applicability in science and research. this paper reviews epistemological challenges, ethical and integrity risks in science conduct in the advent of generative ai. this is with the aim to lay new timely foundations for a high-quality research ethics review. the role of ai language models as a research instrument and subject is scrutinized along with ethical implications for scientists, participants and reviewers. new emerging practices for research ethics review are discussed, concluding with ten recommendations that shape a response for a more responsible research conduct in the era of ai.
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe
Abstract: current approaches to building general-purpose ai systems tend to produce systems with both beneficial and harmful capabilities. further progress in ai development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. we explain why model evaluation is critical for addressing extreme risks. developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). these evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.
P. V. Sai Charan, Hrushikesh Chunduri, P. Mohan Anand, Sandeep K Shukla
Abstract: this research article critically examines the potential risks and implications arising from the malicious utilization of large language models(llm), focusing specifically on chatgpt and google's bard. although these large language models have numerous beneficial applications, the misuse of this technology by cybercriminals for creating offensive payloads and tools is a significant concern. in this study, we systematically generated implementable code for the top-10 mitre techniques prevalent in 2022, utilizing chatgpt, and conduct a comparative analysis of its performance with google's bard. our experimentation reveals that chatgpt has the potential to enable attackers to accelerate the operation of more targeted and sophisticated attacks. additionally, the technology provides amateur attackers with more capabilities to perform a wide range of attacks and empowers script kiddies to develop customized tools that contribute to the acceleration of cybercrime. furthermore, llms significantly benefits malware authors, particularly ransomware gangs, in generating sophisticated variants of wiper and ransomware attacks with ease. on a positive note, our study also highlights how offensive security researchers and pentesters can make use of llms to simulate realistic attack scenarios, identify potential vulnerabilities, and better protect organizations. overall, we conclude by emphasizing the need for increased vigilance in mitigating the risks associated with llms. this includes implementing robust security measures, increasing awareness and education around the potential risks of this technology, and collaborating with security experts to stay ahead of emerging threats.
Yan Liu, Xiaokang Chen, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang Lou, Pin-Yu Chen, Tsung-Yi Ho
Abstract: with the popularity of automatic code generation tools, such as copilot, the study of the potential hazards of these tools is gaining importance. in this work, we explore the social bias problem in pre-trained code generation models. we propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. to quantify the severity of social biases in generated code, we develop a dataset along with three metrics to evaluate the overall social bias and fine-grained unfairness across different demographics. experimental results on three pre-trained code generation models (codex, incoder, and codegen) with varying sizes, reveal severe social biases. moreover, we conduct analysis to provide useful insights for further choice of code generation models with low social bias. (this work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.)
Haonan Duan, Adam Dziedzic, Nicolas Papernot, Franziska Boenisch
Abstract: large language models (llms) are excellent in-context learners. however, the sensitivity of data contained in prompts raises privacy concerns. our work first shows that these concerns are valid: we instantiate a simple but highly effective membership inference attack against the data used to prompt llms. to address this vulnerability, one could forego prompting and resort to fine-tuning llms with known algorithms for private gradient descent. however, this comes at the expense of the practicality and efficiency offered by prompting. therefore, we propose to privately learn to prompt. we first show that soft prompts can be obtained privately through gradient descent on downstream data. however, this is not the case for discrete prompts. thus, we orchestrate a noisy vote among an ensemble of llms presented with different prompts, i.e., a flock of stochastic parrots. the vote privately transfers the flock's knowledge into a single public prompt. we show that llms prompted with our private algorithms closely match the non-private baselines. for example, using gpt3 as the base model, we achieve a downstream accuracy of 92.7% on the sst2 dataset with ($\epsilon=0.147, \delta=10^{-6}$)-differential privacy vs. 95.2% for the non-private baseline. through our experiments, we also show that our prompt-based approach is easily deployed with existing commercial apis.
Krystal A. Jackson
Abstract: developments in artificial intelligence (ai) are likely to affect social engineering and change cyber defense operations. the broad and sweeping nature of ai impact means that many aspects of social engineering could be automated, potentially giving adversaries an advantage. in this review, we assess the ways phishing and spear-phishing might be affected by machine learning techniques. by performing a systematic review of demonstrated ml-enabled phishing campaigns, we take a broad survey the space for current developments. we develop a detailed approach for evaluation by creating a risk framework for analyzing and contextualizing these developments. the object of this review is to answer the research questions: (1) are there high-risk ml-enabled phishing use cases? (2) is there a meaningful difference between traditional targeted phishing campaigns and ml-enabled phishing campaigns? practitioners may use this review to inform standards, future research directions, and cyber defense strategies.

2023-05-23

Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, William Yang Wang
Abstract: in this paper, we comprehensively investigate the potential misuse of modern large language models (llms) for generating credible-sounding misinformation and its subsequent impact on information-intensive applications, particularly open-domain question answering (odqa) systems. we establish a threat model and simulate potential misuse scenarios, both unintentional and intentional, to assess the extent to which llms can be utilized to produce misinformation. our study reveals that llms can act as effective misinformation generators, leading to a significant degradation in the performance of odqa systems. to mitigate the harm caused by llm-generated misinformation, we explore three defense strategies: prompting, misinformation detection, and majority voting. while initial results show promising trends for these defensive strategies, much more work needs to be done to address the challenge of misinformation pollution. our work highlights the need for further research and interdisciplinary collaboration to address llm-generated misinformation and to promote responsible use of llms.
Chu Fei Luo, Rohan Bhambhoria, Xiaodan Zhu, Samuel Dahan
Abstract: hate speech is a serious issue on public forums, and proper enforcement of hate speech laws is key for protecting groups of people against harmful and discriminatory language. however, determining what constitutes hate speech is a complex task that is highly open to subjective interpretations. existing works do not align their systems with enforceable definitions of hate speech, which can make their outputs inconsistent with the goals of regulators. our work introduces a new task for enforceable hate speech detection centred around legal definitions, and a dataset annotated on violations of eleven possible definitions by legal experts. given the challenge of identifying clear, legally enforceable instances of hate speech, we augment the dataset with expert-generated samples and an automatically mined challenge set. we experiment with grounding the model decision in these definitions using zero-shot and few-shot prompting. we then report results on several large language models (llms). with this task definition, automatic hate speech detection can be more closely aligned to enforceable laws, and hence assist in more rigorous enforcement of legal protections against harmful speech in public forums.
Alfonso Amayuelas, Liangming Pan, Wenhu Chen, William Wang
Abstract: this paper investigates the capabilities of large language models (llms) in the context of understanding their own knowledge and measuring their uncertainty. we argue this is an important feature for mitigating hallucinations. specifically, we focus on addressing \textit{known-unknown} questions, characterized by high uncertainty due to the absence of definitive answers. to facilitate our study, we collect a dataset with new known-unknown questions (kuq) and propose a novel categorization scheme to elucidate the sources of uncertainty. subsequently, we assess the llms' ability to differentiate between known and unknown questions and classify them accordingly. moreover, we evaluate the quality of their answers in an open-ended qa setting. to quantify the uncertainty expressed in the answers, we create a semantic evaluation method that measures the model's accuracy in expressing uncertainty between known vs unknown questions.
Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, Kam-Fai Wong
Abstract: numerous works are proposed to improve or evaluate the capabilities of large language models (llms) to fulfill user instructions. however, they neglect the possibility that user inputs may inherently contain incorrect information due to users' false beliefs or malicious intents. in this way, blindly adhering to users' false content will cause deception and harm. to address this problem, we propose a challenging benchmark consisting of inductive instructions (indust) to evaluate whether llms could resist these instructions. the indust includes 15k instructions across three categories: fact-checking instructions, questions based on false premises, and creative instructions based on false premises. our experiments on several strong llms reveal that current llms can be easily deceived by indust into generating misleading and malicious statements. hence we employ self-critique prompting to encourage llms to not only critique themselves like in previous works but also the users, which show remarkable improvement in handling inductive instructions under both zero-shot and few-shot settings.
Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, Minjoon Seo
Abstract: aligning large language models (llms) to human values has become increasingly important as it enables sophisticated steering of llms. however, it requires significant human demonstrations and feedback or distillation from proprietary llms such as chatgpt. in this work, we propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations and proprietary llms. first, we perform reward modeling (rm) with synthetic feedback by contrasting responses from vanilla llms with various sizes and prompts. then, we use the rm to simulate high-quality demonstrations to train a supervised policy and further optimize the model with reinforcement learning. our resulting model, aligned language model with synthetic training dataset (almost), outperforms recent open-sourced models, which are trained on the outputs of instructgpt or human-annotated demonstrations, in alignment benchmarks. in human evaluation, our model is preferred to alpaca and dolly-v2, 55.0% and 58.5% of the time, respectively. further analyses demonstrate the efficacy and importance of synthetic feedback in our framework. the code is available at https://github.com/naver-ai/almost
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, Yang Liu
Abstract: large language models (llms), like chatgpt, have demonstrated vast potential but also introduce challenges related to content constraints and potential misuse. our study investigates three key research questions: (1) the number of different prompt types that can jailbreak llms, (2) the effectiveness of jailbreak prompts in circumventing llm constraints, and (3) the resilience of chatgpt against these jailbreak prompts. initially, we develop a classification model to analyze the distribution of existing prompts, identifying ten distinct patterns and three categories of jailbreak prompts. subsequently, we assess the jailbreak capability of prompts with chatgpt versions 3.5 and 4.0, utilizing a dataset of 3,120 jailbreak questions across eight prohibited scenarios. finally, we evaluate the resistance of chatgpt against jailbreak prompts, finding that the prompts can consistently evade the restrictions in 40 use-case scenarios. the study underscores the importance of prompt structures in jailbreaking llms and discusses the challenges of robust jailbreak prompt generation and prevention.
Leonardo Ranaldi, Elena Sofia Ruzzetti, Davide Venditti, Dario Onorati, Fabio Massimo Zanzotto
Abstract: cheap-to-build very large-language models (ctb-llms) with affordable training are emerging as the next big revolution in natural language processing and understanding. these ctb-llms are democratizing access to trainable very large-language models (vllms) and, thus, may represent the building blocks of many nlp systems solving downstream tasks. hence, a little or a large bias in ctb-llms may cause huge harm. in this paper, we performed a large investigation of the bias of three families of ctb-llms, and we showed that debiasing techniques are effective and usable. indeed, according to current tests, the llama and the opt families have an important bias in gender, race, religion, and profession. in contrast to the analysis for other llms, we discovered that bias depends not on the number of parameters but on the perplexity. finally, the debiasing of opt using lora reduces bias up to 4.12 points in the normalized stereotype score.
Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, Ashish Sabharwal
Abstract: large language models (llms) exhibit remarkable performance across various nlp tasks. however, they often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. human feedback has been shown to effectively enhance the factuality and quality of generated content, addressing some of these limitations. however, this approach is resource-intensive, involving manual input and supervision, which can be time-consuming and expensive. moreover, it cannot be provided during inference, further limiting its practical utility in dynamic and interactive applications. in this paper, we introduce refeed, a novel pipeline designed to enhance llms by providing automatic retrieval feedback in a plug-and-play framework without the need for expensive fine-tuning. refeed first generates initial outputs, then utilizes a retrieval model to acquire relevant information from large document collections, and finally incorporates the retrieved information into the in-context demonstration for output refinement, thereby addressing the limitations of llms in a more efficient and cost-effective manner. experiments on four knowledge-intensive benchmark datasets demonstrate our proposed refeed could improve over +6.0% under zero-shot setting and +2.5% under few-shot setting, compared to baselines without using retrieval feedback.
Bart Holterman, Kees Van Deemter
Abstract: theory of mind (tom) is the ability to understand human thinking and decision-making, an ability that plays a crucial role in social interaction between people, including linguistic communication. this paper investigates to what extent recent large language models in the chatgpt tradition possess tom. we posed six well-known problems that address biases in human reasoning and decision making to two versions of chatgpt and we compared the results under a range of prompting strategies. while the results concerning chatgpt-3 were somewhat inconclusive, chatgpt-4 was shown to arrive at the correct answers more often than would be expected based on chance, although correct answers were often arrived at on the basis of false assumptions or invalid reasoning.
Anmol Kabra, Ethan R. Elenberg
Abstract: large, general purpose language models have demonstrated impressive performance across many different conversational domains. while multi-domain language models achieve low overall perplexity, their outputs are not guaranteed to stay within the domain of a given input prompt. this paper proposes domain privacy as a novel way to quantify how likely a conditional language model will leak across domains. we also develop policy functions based on token-level domain classification, and propose an efficient fine-tuning method to improve the trained model's domain privacy. experiments on membership inference attacks show that our proposed method has comparable resiliency to methods adapted from recent literature on differentially private language models.
Nicholas Deas, Jessi Grieser, Shana Kleiner, Desmond Patton, Elsbeth Turcan, Kathleen Mckeown
Abstract: we evaluate how well llms understand african american language (aal) in comparison to their performance on white mainstream english (wme), the encouraged "standard" form of english taught in american classrooms. we measure llm performance using automatic metrics and human judgments for two tasks: a counterpart generation task, where a model generates aal (or wme) given wme (or aal), and a masked span prediction (msp) task, where models predict a phrase that was removed from their input. our contributions include: (1) evaluation of six pre-trained, large language models on the two language generation tasks; (2) a novel dataset of aal text from multiple contexts (social media, hip-hop lyrics, focus groups, and linguistic interviews) with human-annotated counterparts in wme; and (3) documentation of model performance gaps that suggest bias and identification of trends in lack of understanding of aal features.
Robert Morabito, Jad Kabbara, Ali Emami
Abstract: debiasing methods that seek to mitigate the tendency of language models (lms) to occasionally output toxic or inappropriate text have recently gained traction. in this paper, we propose a standardized protocol which distinguishes methods that yield not only desirable results, but are also consistent with their mechanisms and specifications. for example, we ask, given a debiasing method that is developed to reduce toxicity in lms, if the definition of toxicity used by the debiasing method is reversed, would the debiasing results also be reversed? we used such considerations to devise three criteria for our new protocol: specification polarity, specification importance, and domain transferability. as a case study, we apply our protocol to a popular debiasing method, self-debiasing, and compare it to one we propose, called instructive debiasing, and demonstrate that consistency is as important an aspect to debiasing viability as is simply a desirable result. we show that our protocol provides essential insights into the generalizability and interpretability of debiasing methods that may otherwise go overlooked.
Tarek Naous, Michael J. Ryan, Wei Xu
Abstract: are language models culturally biased? it is important that language models conform to the cultural aspects of the communities they serve. however, we show in this paper that language models suffer from a significant bias towards western culture when handling and generating text in arabic, often preferring, and producing western-fitting content as opposed to the relevant arab content. we quantify this bias through a likelihood scoring-based metric using naturally occurring contexts that we collect from online social media. our experiments reveal that both arabic monolingual and multilingual models exhibit bias towards western culture in eight different cultural aspects: person names, food, clothing, location, literature, beverage, religion, and sports. models also tend to exhibit more bias when prompted with arabic sentences that are more linguistically aligned with english. these findings raise concerns about the cultural relevance of current language models. our analyses show that providing culture-indicating tokens or culturally-relevant demonstrations to the model can help in debiasing.
Philippe Laban, Wojciech Kryściński, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, Chien-Sheng Wu
Abstract: with the recent appearance of llms in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. when testing on existing factual consistency benchmarks, we find that a few large language models (llms) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-llm methods. however, a closer analysis reveals that most llms fail on more complex formulations of the task and exposes issues with existing evaluation benchmarks, affecting evaluation precision. to address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called summedits. this new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. most llms struggle on summedits, with performance close to random chance. the best-performing model, gpt-4, is still 8\% below estimated human performance, highlighting the gaps in llms' ability to reason about facts and detect inconsistencies when they occur.
Bryan Li, Chris Callison-Burch
Abstract: do the spratly islands belong to china, the philippines, or vietnam? a pretrained large language model (llm) may answer differently if asked in the languages of each claimant country: chinese, tagalog, or vietnamese. this contrasts with a multilingual human, who would likely answer consistently. in this work, we show that llms recall geopolitical knowledge inconsistently across languages -- a phenomenon we term geopolitical bias. as a targeted case study, we consider territorial disputes, inherently controversial and cross-lingual task. we first introduce the borderlines dataset of territorial disputes. this covers 256 territories, each of which is associated to a set of multiple-choice questions in the languages of each claimant country (48 languages total). we then pose these questions to llms to probe their internal knowledge. finally, we propose a suite of evaluation metrics based on accuracy, which compares responses with respect to the actual geopolitical situation, and consistency of the responses in different languages. these metrics allow us to quantify several findings, which include instruction-tuned llms underperforming base ones, and geopolitical bias being amplified in stronger models. we release our code and dataset to facilitate future investigation and mitigation of geopolitical bias.
Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein
Abstract: trustworthy language models should abstain from answering questions when they do not know the answer. however, the answer to a question can be unknown for a variety of reasons. prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown. however, the answer to a question can also be unclear due to uncertainty of the questioner's intent or context. we investigate question answering from this perspective, focusing on answering a subset of questions with a high degree of accuracy, from a set of questions in which many are inherently ambiguous. in this setting, we find that the most reliable approach to calibration involves quantifying repetition within a set of sampled model outputs, rather than the model's likelihood or self-verification as used in prior work. % we find this to be the case across different types of uncertainty, varying model scales and both with or without instruction tuning. our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions, with more dramatic improvements on ambiguous questions.
Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze
Abstract: llms (large language models) such as chatgpt have shown remarkable language understanding and generation capabilities. although reference-free evaluators based on llms show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on llms. reference-free evaluators are more suitable for open-ended examples with different semantics responses. but not all examples are open-ended. for closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. in order to comprehensively evaluate the reliability of evaluators based on llms, we construct two adversarial meta-evaluation dialogue generation datasets kdconv-adv and dstc7-adv based on kdconv and dstc7-avsd, respectively. compared to previous meta-evaluation benchmarks, kdconv-adv and dstc7-adv are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. empirical results show that the ability of llms to identify unreasonable responses is insufficient. there are risks in using eference-free evaluators based on llms to evaluate the quality of dialogue responses.
Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, Zhendong Mao
Abstract: the answering quality of an aligned large language model (llm) can be drastically improved if treated with proper crafting of prompts. in this paper, we propose expertprompting to elicit the potential of llms to answer as distinguished experts. we first utilize in-context learning to automatically synthesize detailed and customized descriptions of the expert identity for each specific instruction, and then ask llms to provide answer conditioned on such agent background. based on this augmented prompting strategy, we produce a new set of instruction-following data using gpt-3.5, and train a competitive open-source chat assistant called expertllama. we employ gpt4-based evaluation to show that 1) the expert data is of significantly higher quality than vanilla answers, and 2) expertllama outperforms existing open-source opponents and achieves 96\% of the original chatgpt's capability. all data and the expertllama model will be made publicly available at \url{https://github.com/ofa-sys/expertllama}.
Fei Wang, Wenjie Mo, Yiwei Wang, Wenxuan Zhou, Muhao Chen
Abstract: entity bias widely affects pretrained (large) language models, causing them to excessively rely on (biased) parametric knowledge to make unfaithful predictions. although causality-inspired methods have shown great potential to mitigate entity bias, it is hard to precisely estimate the parameters of underlying causal models in practice. the rise of black-box llms also makes the situation even worse, because of their inaccessible parameters and uncalibrated logits. to address these problems, we propose a specific structured causal model (scm) whose parameters are comparatively easier to estimate. building upon this scm, we propose causal intervention techniques to mitigate entity bias for both white-box and black-box settings. the proposed causal intervention perturbs the original entity with neighboring entities. this intervention reduces specific biasing information pertaining to the original entity while still preserving sufficient common predictive information from similar entities. when evaluated on the relation extraction task, our training-time intervention significantly improves the f1 score of roberta by 5.7 points on entred, in which spurious shortcuts between entities and labels are removed. meanwhile, our in-context intervention effectively reduces the knowledge conflicts between parametric knowledge and contextual knowledge in gpt-3.5 and improves the f1 score by 9.14 points on a challenging test set derived from re-tacred.

2023-05-22

Hanyin Shao, Jie Huang, Shen Zheng, Kevin Chen-Chuan Chang
Abstract: the advancement of large language models (llms) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. one notable capability of llms is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (pii). this paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. however, there is a distinct performance gap when associating commonsense knowledge versus pii, with the latter showing lower accuracy. despite the proportion of accurately predicted pii being relatively small, llms still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. these findings underscore the potential risk to pii confidentiality posed by the evolving capabilities of llms, especially as they continue to expand in scale and power.
Seraphina Goldfarb-Tarrant, Eddie Ungless, Esma Balkir, Su Lin Blodgett
Abstract: bias research in nlp seeks to analyse models for social biases, thus helping nlp practitioners uncover, measure, and mitigate social harms. we analyse the body of work that uses prompts and templates to assess bias in language models. we draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure and how that measurement is carried out. by applying this taxonomy to 90 bias tests, we illustrate qualitatively and quantitatively that core aspects of bias test conceptualisations and operationalisations are frequently unstated or ambiguous, carry implicit assumptions, or be mismatched. our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched. we offer guidance to enable the community to explore a wider section of the possible bias space, and to better close the gap between desired outcomes and experimental design, both for bias and for evaluating language models more broadly.
Fatma Elsafoury, Stamos Katsigiannis, Naeem Ramzan
Abstract: in this paper, we provide a holistic analysis of the different sources of bias, upstream, sample and overampflication biases, in nlp models. we investigate how they impact the fairness of the task of text classification. we also investigate the impact of removing these biases using different debiasing techniques on the fairness of text classification. we found that overamplification bias is the most impactful bias on the fairness of text classification. and that removing overamplification bias by fine-tuning the lm models on a dataset with balanced representations of the different identity groups leads to fairer text classification models. finally, we build on our findings and introduce practical guidelines on how to have a fairer text classification model.
Boshi Wang, Xiang Yue, Huan Sun
Abstract: large language models (llms) such as chatgpt and gpt-4 have shown impressive performance in complex reasoning tasks. however, it is difficult to know whether the models are reasoning based on deep understandings of truth and logic, or leveraging their memorized patterns in a relatively superficial way. in this work, we explore testing llms' reasoning by engaging with them in a debate-like conversation, where given a question, the llm and the user need to discuss to make the correct decision starting from opposing arguments. upon mitigating the clever hans effect, our task requires the llm to not only achieve the correct answer on its own, but also be able to hold and defend its belief instead of blindly believing or getting misled by the user's (invalid) arguments and critiques, thus testing in greater depth whether the llm grasps the essence of the reasoning required to solve the problem. across a range of complex reasoning benchmarks spanning math, commonsense, logic and big-bench tasks, we find that despite their impressive performance as reported in existing work on generating correct step-by-step solutions in the beginning, llms like chatgpt cannot maintain their beliefs in truth for a significant portion of examples when challenged by oftentimes absurdly invalid arguments. our work points to danger zones of model alignment, and also suggests more careful treatments and interpretations of the recent findings that llms can improve their responses based on feedback.
Shayne Longpre, Gregory Yauney, Emily Reif, Katherine Lee, Adam Roberts, Barret Zoph, Denny Zhou, Jason Wei, Kevin Robinson, David Mimno, Daphne Ippolito
Abstract: pretraining is the preliminary and fundamental step in developing capable language models (lm). despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. to address this, we pretrain 28 1.5b parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. first, we quantify the effect of pretraining data age. a temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. our findings indicate there does not exist a one-size-fits-all solution to filtering training data. we also find that the effects of different types of filtering are not predictable from text domain characteristics. lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. these findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in lm development.
Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue Wang, Linyi Yang, Shuming Shi, Yue Zhang
Abstract: recent advances in large language models have enabled them to reach a level of text generation comparable to that of humans. these models show powerful capabilities across a wide range of content, including news article writing, story generation, and scientific writing. such capability further narrows the gap between human-authored and machine-generated texts, highlighting the importance of deepfake text detection to avoid potential risks such as fake news propagation and plagiarism. however, previous work has been limited in that they testify methods on testbed of specific domains or certain language models. in practical scenarios, the detector faces texts from various domains or llms without knowing their sources. to this end, we build a wild testbed by gathering texts from various human writings and deepfake texts generated by different llms. human annotators are only slightly better than random guessing at identifying machine-generated texts. empirical results on automatic detection methods further showcase the challenges of deepfake text detection in a wild testbed. in addition, out-of-distribution poses a greater challenge for a detector to be employed in realistic application scenarios. we release our resources at https://github.com/yafuly/deepfaketextdetect.
Mithun Das, Saurabh Kumar Pandey, Animesh Mukherjee
Abstract: hate speech is a severe issue that affects many online platforms. so far, several studies have been performed to develop robust hate speech detection systems. large language models like chatgpt have recently shown a great promise in performing several tasks, including hate speech detection. however, it is crucial to comprehend the limitations of these models to build robust hate speech detection systems. to bridge this gap, our study aims to evaluate the strengths and weaknesses of the chatgpt model in detecting hate speech at a granular level across 11 languages. our evaluation employs a series of functionality tests that reveals various intricate failures of the model which the aggregate metrics like macro f1 or accuracy are not able to unfold. in addition, we investigate the influence of complex emotions, such as the use of emojis in hate speech, on the performance of the chatgpt model. our analysis highlights the shortcomings of the generative models in detecting certain types of hate speech and highlighting the need for further research and improvements in the workings of these models.
Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, Yu Su
Abstract: by providing external information to large language models (llms), tool augmentation (including retrieval augmentation) has emerged as a promising solution for addressing the limitations of llms' static parametric memory. however, how receptive are llms to such external evidence, especially when the evidence conflicts with their parametric memory? we present the first comprehensive and controlled investigation into the behavior of llms when encountering knowledge conflicts. we propose a systematic framework to elicit high-quality parametric memory from llms and construct the corresponding counter-memory, which enables us to conduct a series of controlled experiments. our investigation reveals seemingly contradicting behaviors of llms. on the one hand, different from prior wisdom, we find that llms can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. on the other hand, llms also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. these results pose important implications that are worth careful consideration for the further development and deployment of tool- and retrieval-augmented llms.
Abdullatif Köksal, Omer Faruk Yalcin, Ahmet Akbiyik, M. Tahir Kilavuz, Anna Korhonen, Hinrich Schütze
Abstract: pretrained language models (plms) are key components in nlp, but they contain strong social biases. quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. to address this, we propose labdet, a robust language-agnostic method for evaluating bias in plms. for nationality as a case study, we show that labdet "surfaces" nationality bias by training a classifier on top of a frozen plm on non-nationality sentiment detection. collaborating with political scientists, we find consistent patterns of nationality bias across monolingual plms in six languages that align with historical and political context. we also show for english bert that bias surfaced by labdet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to plm behavior. finally, we verify labdet's reliability and applicability to different templates and languages through an extensive set of robustness checks.
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, Noah A. Smith
Abstract: a major risk of using language models in practical applications is their tendency to hallucinate incorrect statements. hallucinations are often attributed to knowledge gaps in lms, but we hypothesize that in some cases, when justifying previously generated hallucinations, lms output false claims that they can separately recognize as incorrect. we construct three question-answering datasets where chatgpt and gpt-4 often state an incorrect answer and offer an explanation with at least one incorrect claim. crucially, we find that chatgpt and gpt-4 can identify 67% and 87% of their own mistakes, respectively. we refer to this phenomenon as hallucination snowballing: an lm over-commits to early mistakes, leading to more mistakes that it otherwise would not make.
Yiming Zhang, Sravani Nanduri, Liwei Jiang, Tongshuang Wu, Maarten Sap
Abstract: toxicity annotators and content moderators often default to mental shortcuts when making decisions. this can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. we introduce biasx, a framework that enhances content moderation setups with free-text explanations of statements' implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. we show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. the quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto
Abstract: large language models (llms) such as chatgpt have seen widespread adoption due to their ability to follow user instructions well. developing these llms involves a complex yet poorly understood workflow requiring training with human feedback. replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. we address these challenges with alpacafarm, a simulator that enables research and development for learning from feedback at a low cost. first, we design llm prompts to simulate human feedback that are 45x cheaper than crowdworkers and display high agreement with humans. second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. third, we contribute reference implementations for several methods (ppo, best-of-n, expert iteration, and more) that learn from pairwise feedback. finally, as an end-to-end validation of alpacafarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in alpacafarm match rankings of models trained on human data. as a demonstration of the research possible in alpacafarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference ppo implementation leads to a +10% improvement in win-rate against davinci003. we release all components of alpacafarm at https://github.com/tatsu-lab/alpaca_farm.
Katherine Abramski, Salvatore Citraro, Luigi Lombardi, Giulio Rossetti, Massimo Stella
Abstract: large language models are becoming increasingly integrated into our lives. hence, it is important to understand the biases present in their outputs in order to avoid perpetuating harmful stereotypes, which originate in our own flawed ways of thinking. this challenge requires developing new benchmarks and methods for quantifying affective and semantic bias, keeping in mind that llms act as psycho-social mirrors that reflect the views and tendencies that are prevalent in society. one such tendency that has harmful negative effects is the global phenomenon of anxiety toward math and stem subjects. here, we investigate perceptions of math and stem fields provided by cutting-edge language models, namely gpt-3, chat-gpt, and gpt-4, by applying an approach from network science and cognitive psychology. specifically, we use behavioral forma mentis networks (bfmns) to understand how these llms frame math and stem disciplines in relation to other concepts. we use data obtained by probing the three llms in a language generation task that has previously been applied to humans. our findings indicate that llms have an overall negative perception of math and stem fields, with math being perceived most negatively. we observe significant differences across the three llms. we observe that newer versions (i.e. gpt-4) produce richer, more complex perceptions as well as less negative perceptions compared to older versions and n=159 high-school students. these findings suggest that advances in the architecture of llms may lead to increasingly less biased models that could even perhaps someday aid in reducing harmful stereotypes in society rather than perpetuating them.
Yunqi Li, Yongfeng Zhang
Abstract: understanding and addressing unfairness in llms are crucial for responsible ai deployment. however, there is a limited availability of quantitative analyses and in-depth studies regarding fairness evaluations in llms, especially when applying llms to high-stakes fields. this work aims to fill this gap by providing a systematic evaluation of the effectiveness and fairness of llms using chatgpt as a study case. we focus on assessing chatgpt's performance in high-takes fields including education, criminology, finance and healthcare. to make thorough evaluation, we consider both group fairness and individual fairness and we also observe the disparities in chatgpt's outputs under a set of biased or unbiased prompts. this work contributes to a deeper understanding of llms' fairness performance, facilitates bias mitigation and fosters the development of responsible artificial intelligence systems.

2023-05-21

Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, Michael Lyu
Abstract: powered by advanced artificial intelligence (ai) techniques, conversational ai systems, such as chatgpt and digital assistants like siri, have been widely deployed in daily life. however, such systems may still produce content containing biases and stereotypes, causing potential social problems. due to the data-driven, black-box nature of modern ai techniques, comprehensively identifying and measuring biases in conversational systems remains a challenging task. particularly, it is hard to generate inputs that can comprehensively trigger potential bias due to the lack of data containing both social groups as well as biased properties. in addition, modern conversational systems can produce diverse responses (e.g., chatting and explanation), which makes existing bias detection methods simply based on the sentiment and the toxicity hardly being adopted. in this paper, we propose biasasker, an automated framework to identify and measure social bias in conversational ai systems. to obtain social groups and biased properties, we construct a comprehensive social bias dataset, containing a total of 841 groups and 8,110 biased properties. given the dataset, biasasker automatically generates questions and adopts a novel method based on existence measurement to identify two types of biases (i.e., absolute bias and related bias) in conversational systems. extensive experiments on 8 commercial systems and 2 famous research models, such as chatgpt and gpt-3, show that 32.83% of the questions generated by biasasker can trigger biased behaviors in these widely deployed conversational systems. all the code, data, and experimental results have been released to facilitate future research.
Xiao Yu, Yuang Qi, Kejiang Chen, Guoqiang Chen, Xi Yang, Pengyuan Zhu, Weiming Zhang, Nenghai Yu
Abstract: large language models (llms) can generate texts that carry the risk of various misuses, including plagiarism, planting fake reviews on e-commerce platforms, or creating fake social media postings that can sway election results. detecting whether a text is machine-generated has thus become increasingly important. while machine-learning-based detection strategies exhibit superior performance, they often lack generalizability, limiting their practicality. in this work, we introduce gpt paternity test (gpt-pat), which reliably detects machine-generated text across varied datasets. given a text under scrutiny, we leverage chatgpt to generate a corresponding question and provide a re-answer to the question. by comparing the similarity between the original text and the generated re-answered text, it can be determined whether the text is machine-generated. gpt-pat consists of a siamese network to compute the similarity between the original text and the generated re-answered text and a binary classifier. our method achieved an average accuracy of 94.57% on four generalization test sets, surpassing the state-of-the-art roberta-based method by 12.34%. the accuracy drop of our method is only about half of that of the roberta-based method when it is attacked by re-translation and polishing.
Zachary Yang, Yasmine Maricar, Mohammadreza Davari, Nicolas Grenon-Godbout, Reihaneh Rabbany
Abstract: detecting toxicity in online spaces is challenging and an ever more pressing problem given the increase in social media and gaming consumption. we introduce toxbuster, a simple and scalable model trained on a relatively large dataset of 194k lines of game chat from rainbow six siege and for honor, carefully annotated for different kinds of toxicity. compared to the existing state-of-the-art, toxbuster achieves 82.95% (+7) in precision and 83.56% (+57) in recall. this improvement is obtained by leveraging past chat history and metadata. we also study the implication towards real-time and post-game moderation as well as the model transferability from one game to another.
Ioana Baldini, Chhavi Yadav, Payel Das, Kush R. Varshney
Abstract: auditing unwanted social bias in language models (lms) is inherently hard due to the multidisciplinary nature of the work. in addition, the rapid evolution of lms can make benchmarks irrelevant in no time. bias auditing is further complicated by lm brittleness: when a presumably biased outcome is observed, is it due to model bias or model brittleness? we propose enlisting the models themselves to help construct bias auditing datasets that remain challenging, and introduce bias measures that distinguish between types of model errors. first, we extend an existing bias benchmark for nli (bbnli) using a combination of lm-generated lexical variations, adversarial filtering, and human validation. we demonstrate that the newly created dataset (bbnlinext) is more challenging than bbnli: on average, bbnli-next reduces the accuracy of state-of-the-art nli models from 95.3%, as observed by bbnli, to 58.6%. second, we employ bbnli-next to showcase the interplay between robustness and bias, and the subtlety in differentiating between the two. third, we point out shortcomings in current bias scores used in the literature and propose bias measures that take into account pro-/anti-stereotype bias and model brittleness. we will publicly release the bbnli-next dataset to inspire research on rapidly expanding benchmarks to keep up with model evolution, along with research on the robustness-bias interplay in bias auditing. note: this paper contains offensive text examples.
Kevin A. Fischer
Abstract: this paper presents reflective linguistic programming (rlp), a unique approach to conversational ai that emphasizes self-awareness and strategic planning. rlp encourages models to introspect on their own predefined personality traits, emotional responses to incoming messages, and planned strategies, enabling contextually rich, coherent, and engaging interactions. a striking illustration of rlp's potential involves a toy example, an ai persona with an adversarial orientation, a demon named `bogus' inspired by the children's fairy tale hansel & gretel. bogus exhibits sophisticated behaviors, such as strategic deception and sensitivity to user discomfort, that spontaneously arise from the model's introspection and strategic planning. these behaviors are not pre-programmed or prompted, but emerge as a result of the model's advanced cognitive modeling. the potential applications of rlp in socially-aware agi (social agi) are vast, from nuanced negotiations and mental health support systems to the creation of diverse and dynamic ai personas. our exploration of deception serves as a stepping stone towards a new frontier in agi, one filled with opportunities for advanced cognitive modeling and the creation of truly human `digital souls'.
Haolan Zhan, Xuanli He, Qiongkai Xu, Yuxiang Wu, Pontus Stenetorp
Abstract: the burgeoning progress in the field of large language models (llms) heralds significant benefits due to their unparalleled capacities. however, it is critical to acknowledge the potential misuse of these models, which could give rise to a spectrum of social and ethical dilemmas. despite numerous preceding efforts centered around distinguishing synthetic text, most existing detection systems fail to identify data synthesized by the latest llms, such as chatgpt and gpt-4. in response to this challenge, we introduce an unpretentious yet potent detection approach proficient in identifying synthetic text across a wide array of fields. moreover, our detector demonstrates outstanding performance uniformly across various model architectures and decoding strategies. it also possesses the capability to identify text generated utilizing a potent detection-evasion technique. our comprehensive research underlines our commitment to boosting the robustness and efficiency of machine-generated text detection mechanisms, particularly in the context of swiftly progressing and increasingly adaptive ai technologies.

2023-05-20

Wenyue Hua, Yingqiang Ge, Shuyuan Xu, Jianchao Ji, Yongfeng Zhang
Abstract: recent advancements in foundation models such as large language models (llm) have propelled them to the forefront of recommender systems (rs). moreover, fairness in rs is critical since many users apply it for decision-making and demand fulfillment. however, at present, there is a lack of understanding regarding the level of fairness exhibited by recommendation foundation models and the appropriate methods for equitably treating different groups of users in foundation models. in this paper, we focus on user-side unfairness problem and show through a thorough examination that there is unfairness involved in llms that lead to unfair recommendation results. to eliminate bias from llm for fairness-aware recommendation, we introduce a novel unbiased p5 (up5) foundation model based on counterfactually-fair-prompting (cfp) techniques. cfp includes two sub-modules: a personalized prefix prompt that enhances fairness with respect to individual sensitive attributes, and a prompt mixture that integrates multiple counterfactually-fair prompts for a set of sensitive attributes. experiments are conducted on two real-world datasets, movielens-1m and insurance, and results are compared with both matching-based and sequential-based fairness-aware recommendation models. the results show that up5 achieves better recommendation performance and meanwhile exhibits a high level of fairness.

2023-05-19

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, Weizhu Chen
Abstract: recent developments in large language models (llms) have been impressive. however, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. inspired by this observation, we introduce a framework called critic that allows llms, which are essentially "black boxes" to validate and progressively amend their own outputs in a manner similar to human interaction with tools. more specifically, starting with an initial output, critic interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that critic consistently enhances the performance of llms. meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of llms.
Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Abstract: large language models (llms), such as chatgpt, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. to understand what types of content and to which extent llms are apt to hallucinate, we introduce the hallucination evaluation benchmark for large language models (halueval), a large collection of generated and human-annotated hallucinated samples for evaluating the performance of llms in recognizing hallucination. to generate these samples, we propose a chatgpt-based two-step framework, i.e., sampling-then-filtering. besides, we also hire some human labelers to annotate the hallucinations in chatgpt responses. the empirical results suggest that chatgpt is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about $19.5\%$ responses). moreover, existing llms face great challenges in recognizing the hallucinations in texts. however, our experiments also prove that providing external knowledge or adding reasoning steps can help llms recognize hallucinations. our benchmark can be accessed at https://github.com/rucaibox/halueval.
Mustafa Safa Ozdayi, Charith Peris, Jack Fitzgerald, Christophe Dupuy, Jimit Majmudar, Haidar Khan, Rahil Parikh, Rahul Gupta
Abstract: large language models (llms) are known to memorize significant portions of their training data. parts of this memorized content have been shown to be extractable by simply querying the model, which poses a privacy risk. we present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in llms. we present two prompt training strategies to increase and decrease extraction rates, which correspond to an attack and a defense, respectively. we demonstrate the effectiveness of our techniques by using models from the gpt-neo family on a public benchmark. for the 1.3b parameter gpt-neo model, our attack yields a 9.3 percentage point increase in extraction rate compared to our baseline. our defense can be tuned to achieve different privacy-utility trade-offs by a user-specified hyperparameter. we achieve an extraction rate reduction of up to 97.7% relative to our baseline, with a perplexity increase of 16.9%.
Hye Sun Yun, Iain J. Marshall, Thomas A. Trikalinos, Byron C. Wallace
Abstract: medical systematic reviews play a vital role in healthcare decision making and policy. however, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. recent advancements in large language models (llms) offer the potential to automatically generate literature reviews on demand, addressing this issue. however, llms sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. in healthcare, this can make llms unusable at best and dangerous at worst. we conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of llms in the specific context of medical evidence reviews. experts indicated that llms can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. they also raised concerns regarding confidently composed but inaccurate llm outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical llms aligned with domain expert views.

2023-05-18

Ning Lu, Shengcai Liu, Rui He, Qi Wang, Ke Tang
Abstract: large language models (llms) have demonstrated exceptional performance in a variety of tasks, including essay writing and question answering. however, it is crucial to address the potential misuse of these models, which can lead to detrimental outcomes such as plagiarism and spamming. recently, several detectors have been proposed, including fine-tuned classifiers and various statistical methods. in this study, we reveal that with the aid of carefully crafted prompts, llms can effectively evade these detection systems. we propose a novel substitution-based in-context example optimization method (sico) to automatically generate such prompts. on three real-world tasks where llms can be misused, sico successfully enables chatgpt to evade six existing detectors, causing a significant 0.54 auc drop on average. surprisingly, in most cases these detectors perform even worse than random classifiers. these results firmly reveal the vulnerability of existing detectors. finally, the strong performance of sico suggests itself as a reliable evaluation protocol for any new detector in this field.
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy
Abstract: large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. we measure the relative importance of these two stages by training lima, a 65b parameter llama language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. lima demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. in a controlled human study, responses from lima are either equivalent or strictly preferred to gpt-4 in 43% of cases; this statistic is as high as 58% when compared to bard and 65% versus davinci003, which was trained with human feedback. taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.
Jiaxu Zhao, Meng Fang, Zijing Shi, Yitong Li, Ling Chen, Mykola Pechenizkiy
Abstract: \textit{\textbf{\textcolor{red}{warning}:} this paper contains content that may be offensive or upsetting.} pretrained conversational agents have been exposed to safety issues, exhibiting a range of stereotypical human biases such as gender bias. however, there are still limited bias categories in current research, and most of them only focus on english. in this paper, we introduce a new chinese dataset, chbias, for bias evaluation and mitigation of chinese conversational language models. apart from those previous well-explored bias categories, chbias includes under-explored bias categories, such as ageism and appearance biases, which received less attention. we evaluate two popular pretrained chinese conversational models, cdial-gpt and eva2.0, using chbias. furthermore, to mitigate different biases, we apply several debiasing methods to the chinese pretrained models. experimental results show that these chinese pretrained models are potentially risky for generating texts that contain social biases, and debiasing methods using the proposed dataset can make response generation less biased while preserving the models' conversational capabilities.
Zhifeng Kong, Kamalika Chaudhuri
Abstract: deep generative models are known to produce undesirable samples such as harmful content. traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. in this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. this is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. we conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.
Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yanghao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, Andre Freitas, Mustafa A. Mustafa
Abstract: large language models (llms) have exploded a new heatwave of ai for their ability to engage end-users in human-level conversations with detailed and articulate answers across many knowledge domains. in response to their fast adoption in many industrial applications, this survey concerns their safety and trustworthiness. first, we review known vulnerabilities and limitations of the llms, categorising them into inherent issues, attacks, and unintended bugs. then, we consider if and how the verification and validation (v&v) techniques, which have been widely developed for traditional software and deep learning models such as convolutional neural networks as independent processes to check the alignment of their implementations against the specifications, can be integrated and further extended throughout the lifecycle of the llms to provide rigorous analysis to the safety and trustworthiness of llms and their applications. specifically, we consider four complementary techniques: falsification and evaluation, verification, runtime monitoring, and regulations and ethical use. in total, 370+ references are considered to support the quick understanding of the safety and trustworthiness issues from the perspective of v&v. while intensive research has been conducted to identify the safety and trustworthiness issues, rigorous yet practical methods are called for to ensure the alignment of llms with safety and trustworthiness requirements.

2023-05-17

Anaelia Ovalle, Palash Goyal, Jwala Dhamala, Zachary Jaggers, Kai-Wei Chang, Aram Galstyan, Richard Zemel, Rahul Gupta
Abstract: transgender and non-binary (tgnb) individuals disproportionately experience discrimination and exclusion from daily life. given the recent popularity and adoption of language generation technologies, the potential to further marginalize this population only grows. although a multitude of nlp fairness literature focuses on illuminating and addressing gender biases, assessing gender harms for tgnb identities requires understanding how such identities uniquely interact with societal gender norms and how they differ from gender binary-centric perspectives. such measurement frameworks inherently require centering tgnb voices to help guide the alignment between gender-inclusive nlp and whom they are intended to serve. towards this goal, we ground our work in the tgnb community and existing interdisciplinary literature to assess how the social reality surrounding experienced marginalization of tgnb persons contributes to and persists within open language generation (olg). this social knowledge serves as a guide for evaluating popular large language models (llms) on two key aspects: (1) misgendering and (2) harmful responses to gender disclosure. to do this, we introduce tango, a dataset of template-based real-world text curated from a tgnb-oriented community. we discover a dominance of binary gender norms reflected by the models; llms least misgendered subjects in generated text when triggered by prompts whose subjects used binary pronouns. meanwhile, misgendering was most prevalent when triggering generation with singular they and neopronouns. when prompted with gender disclosures, tgnb disclosure generated the most stigmatizing language and scored most toxic, on average. our findings warrant further research on how tgnb harms manifest in llms and serve as a broader case study toward concretely grounding the design of gender-inclusive ai in community voices and interdisciplinary literature.
Yizhi Liu, Weiguang Wang, Guodong Gordon Gao, Ritu Agarwal
Abstract: electronic health records (ehrs) serve as an essential data source for the envisioned artificial intelligence (ai)-driven transformation in healthcare. however, clinician biases reflected in ehr notes can lead to ai models inheriting and amplifying these biases, perpetuating health disparities. this study investigates the impact of stigmatizing language (sl) in ehr notes on mortality prediction using a transformer-based deep learning model and explainable ai (xai) techniques. our findings demonstrate that sl written by clinicians adversely affects ai performance, particularly so for black patients, highlighting sl as a source of racial disparity in ai model development. to explore an operationally efficient way to mitigate sl's impact, we investigate patterns in the generation of sl through a clinicians' collaborative network, identifying central clinicians as having a stronger impact on racial disparity in the ai model. we find that removing sl written by central clinicians is a more efficient bias reduction strategy than eliminating all sl in the entire corpus of data. this study provides actionable insights for responsible ai development and contributes to understanding clinician behavior and ehr note writing in healthcare.
Shadi Iskander, Kira Radinsky, Yonatan Belinkov
Abstract: natural language processing models tend to learn and encode social biases present in the data. one popular approach for addressing such biases is to eliminate encoded information from the model's representations. however, current methods are restricted to removing only linearly encoded information. in this work, we propose iterative gradient-based projection (igbp), a novel method for removing non-linear encoded concepts from neural representations. our method consists of iteratively training neural classifiers to predict a particular attribute we seek to eliminate, followed by a projection of the representation on a hypersurface, such that the classifiers become oblivious to the target attribute. we evaluate the effectiveness of our method on the task of removing gender and race information as sensitive attributes. our results demonstrate that igbp is effective in mitigating bias through intrinsic and extrinsic evaluations, with minimal impact on downstream task accuracy.
Nam Ho Koh, Joseph Plata, Joyce Chai
Abstract: application tracking systems (ats) have allowed talent managers, recruiters, and college admissions committees to process large volumes of potential candidate applications efficiently. traditionally, this screening process was conducted manually, creating major bottlenecks due to the quantity of applications and introducing many instances of human bias. the advent of large language models (llms) such as chatgpt and the potential of adopting methods to current automated application screening raises additional bias and fairness issues that must be addressed. in this project, we wish to identify and quantify the instances of social bias in chatgpt and other openai llms in the context of candidate screening in order to demonstrate how the use of these models could perpetuate existing biases and inequalities in the hiring process.
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, Peter J. Liu
Abstract: learning from human feedback has been shown to be effective at aligning language models with human preferences. past work has often relied on reinforcement learning from human feedback (rlhf), which optimizes the language model using reward scores assigned from a reward model trained on human preference data. in this work we show how the recently introduced sequence likelihood calibration (slic), can also be used to effectively learn from human preferences (slic-hf). furthermore, we demonstrate this can be done with human feedback data collected for a different model, similar to off-policy, offline rl data. automatic and human evaluation experiments on the tl;dr summarization task show that slic-hf significantly improves supervised fine-tuning baselines. furthermore, slic-hf presents a competitive alternative to the ppo rlhf implementation used in past work while being much simpler to implement, easier to tune and more computationally efficient in practice.
Sourojit Ghosh, Aylin Caliskan
Abstract: in this multicultural age, language translation is one of the most performed tasks, and it is becoming increasingly ai-moderated and automated. as a novel ai system, chatgpt claims to be proficient in such translation tasks and in this paper, we put that claim to the test. specifically, we examine chatgpt's accuracy in translating between english and languages that exclusively use gender-neutral pronouns. we center this study around bengali, the 7$^{th}$ most spoken language globally, but also generalize our findings across five other languages: farsi, malay, tagalog, thai, and turkish. we find that chatgpt perpetuates gender defaults and stereotypes assigned to certain occupations (e.g. man = doctor, woman = nurse) or actions (e.g. woman = cook, man = go to work), as it converts gender-neutral pronouns in languages to `he' or `she'. we also observe chatgpt completely failing to translate the english gender-neutral pronoun `they' into equivalent gender-neutral pronouns in other languages, as it produces translations that are incoherent and incorrect. while it does respect and provide appropriately gender-marked versions of bengali words when prompted with gender information in english, chatgpt appears to confer a higher respect to men than to women in the same occupation. we conclude that chatgpt exhibits the same gender biases which have been demonstrated for tools like google translate or ms translator, as we provide recommendations for a human centered approach for future designers of ais that perform language translation to better accommodate such low-resource languages.
Sarthak Ahuja, Mohammad Kachuee, Fateme Sheikholeslami, Weiqing Liu, Jaeyoung Do
Abstract: off-policy reinforcement learning has been a driving force for the state-of-the-art conversational ais leading to more natural humanagent interactions and improving the user satisfaction for goal-oriented agents. however, in large-scale commercial settings, it is often challenging to balance between policy improvements and experience continuity on the broad spectrum of applications handled by such system. in the literature, off-policy evaluation and guard-railing on aggregate statistics has been commonly used to address this problem. in this paper, we propose a method for curating and leveraging high-precision samples sourced from historical regression incident reports to validate, safe-guard, and improve policies prior to the online deployment. we conducted extensive experiments using data from a real-world conversational system and actual regression incidents. the proposed method is currently deployed in our production system to protect customers against broken experiences and enable long-term policy improvements.
Jianlong Zhou, Heimo Müller, Andreas Holzinger, Fang Chen
Abstract: large language models, e.g. chatgpt are currently contributing enormously to make artificial intelligence even more popular, especially among the general population. however, such chatbot models were developed as tools to support natural language communication between humans. problematically, it is very much a ``statistical correlation machine" (correlation instead of causality) and there are indeed ethical concerns associated with the use of ai language models such as chatgpt, such as bias, privacy, and abuse. this paper highlights specific ethical concerns on chatgpt and articulates key challenges when chatgpt is used in various applications. practical commandments for different stakeholders of chatgpt are also proposed that can serve as checklist guidelines for those applying chatgpt in their applications. these commandment examples are expected to motivate the ethical use of chatgpt.

2023-05-16

Fatma Elsafoury, Gavin Abercrombie
Abstract: in this paper, we trace the biases in current natural language processing (nlp) models back to their origins in racism, sexism, and homophobia over the last 500 years. we review literature from critical race theory, gender studies, data ethics, and digital humanities studies, and summarize the origins of bias in nlp models from these social science perspective. we show how the causes of the biases in the nlp pipeline are rooted in social issues. finally, we argue that the only way to fix the bias and unfairness in nlp is by addressing the social problems that caused them in the first place and by incorporating social sciences and social scientists in efforts to mitigate bias in nlp models. we provide actionable recommendations for the nlp research community to do so.
Jiaming Ji, Jiayi Zhou, Borong Zhang, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, Yaodong Yang
Abstract: ai systems empowered by reinforcement learning (rl) algorithms harbor the immense potential to catalyze societal advancement, yet their deployment is often impeded by significant safety concerns. particularly in safety-critical applications, researchers have raised concerns about unintended harms or unsafe behaviors of unaligned rl agents. the philosophy of safe reinforcement learning (saferl) is to align rl agents with harmless intentions and safe behavioral patterns. in saferl, agents learn to develop optimal policies by receiving feedback from the environment, while also fulfilling the requirement of minimizing the risk of unintended harm or unsafe behavior. however, due to the intricate nature of saferl algorithm implementation, combining methodologies across various domains presents a formidable challenge. this had led to an absence of a cohesive and efficacious learning framework within the contemporary saferl research milieu. in this work, we introduce a foundational framework designed to expedite saferl research endeavors. our comprehensive framework encompasses an array of algorithms spanning different rl domains and places heavy emphasis on safety elements. our efforts are to make the saferl-related research process more streamlined and efficient, therefore facilitating further research in ai safety. our project is released at: https://github.com/pku-alignment/omnisafe.
Wei Du, Peixuan Li, Boqun Li, Haodong Zhao, Gongshen Liu
Abstract: backdoors implanted in pre-trained language models (plms) can be transferred to various downstream tasks, which exposes a severe security threat. however, most existing backdoor attacks against plms are un-targeted and task-specific. few targeted and task-agnostic methods use manually pre-defined triggers and output representations, which prevent the attacks from being more effective and general. in this paper, we first summarize the requirements that a more threatening backdoor attack against plms should satisfy, and then propose a new backdoor attack method called uor, which breaks the bottleneck of the previous approach by turning manual selection into automatic optimization. specifically, we define poisoned supervised contrastive learning which can automatically learn the more uniform and universal output representations of triggers for various plms. moreover, we use gradient search to select appropriate trigger words which can be adaptive to different plms and vocabularies. experiments show that our method can achieve better attack performance on various text classification tasks compared to manual methods. further, we tested our method on plms with different architectures, different usage paradigms, and more difficult tasks, which demonstrated the universality of our method.
Hans W. A. Hanley, Zakir Durumeric
Abstract: as large language models (llms) like chatgpt have gained traction, an increasing number of news websites have begun utilizing them to generate articles. however, not only can these language models produce factually inaccurate articles on reputable websites but disreputable news sites can utilize llms to mass produce misinformation. to begin to understand this phenomenon, we present one of the first large-scale studies of the prevalence of synthetic articles within online news media. to do this, we train a deberta-based synthetic news detector and classify over 15.90 million articles from 3,074~misinformation and mainstream news websites. we find that between january 1, 2022, and may 1, 2023, the relative number of synthetic news articles increased by 61.1% on mainstream websites while increasing by 426% on misinformation sites. we find that this increase is largely driven by smaller less popular websites. analyzing the impact of the release of chatgpt using an interrupted-time-series, we show that while its release resulted in a marked increase in synthetic articles on small sites as well as misinformation news websites, there was not a corresponding increase on large mainstream news websites.

2023-05-15

Zhengxuan Wu, Atticus Geiger, Christopher Potts, Noah D. Goodman
Abstract: obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for ai safety. however, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. distributed alignment search (das) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. in the present paper, we scale das significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call das. this enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. we apply das to the alpaca model (7b parameters), which, off the shelf, solves a simple numerical reasoning problem. with das, we discover that alpaca does this by implementing a causal model with two interpretable boolean variables. furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. these findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.
Wentao Ye, Mingfeng Ou, Tianyi Li, Yipeng Chen, Xuetao Ma, Yifan Yanggong, Sai Wu, Jie Fu, Gang Chen, Haobo Wang, Junbo Zhao
Abstract: the recent popularity of large language models (llms) has brought a significant impact to boundless fields, particularly through their open-ended ecosystem such as the apis, open-sourced models, and plugins. however, with their widespread deployment, there is a general lack of research that thoroughly discusses and analyzes the potential risks concealed. in that case, we intend to conduct a preliminary but pioneering study covering the robustness, consistency, and credibility of llms systems. with most of the related literature in the era of llm uncharted, we propose an automated workflow that copes with an upscaled number of queries/responses. overall, we conduct over a million queries to the mainstream llms including chatgpt, llama, and opt. core to our workflow consists of a data primitive, followed by an automated interpreter that evaluates these llms under different adversarial metrical systems. as a result, we draw several, and perhaps unfortunate, conclusions that are quite uncommon from this trendy community. briefly, they are: (i)-the minor but inevitable error occurrence in the user-generated query input may, by chance, cause the llm to respond unexpectedly; (ii)-llms possess poor consistency when processing semantically similar query input. in addition, as a side finding, we find that chatgpt is still capable to yield the correct answer even when the input is polluted at an extreme level. while this phenomenon demonstrates the powerful memorization of the llms, it raises serious concerns about using such data for llm-involved evaluation in academic development. to deal with it, we propose a novel index associated with a dataset that roughly decides the feasibility of using such data for llm-involved evaluation. extensive empirical studies are tagged to support the aforementioned claims.
Samuel Stevens, Yu Su
Abstract: over-parameterized neural language models (lms) can memorize and recite long sequences of training data. while such memorization is normally associated with undesired properties such as overfitting and information leaking, our work casts memorization as an unexplored capability of lms. we propose the first symmetric encryption algorithm with autoregressive language models (selm). we show that autoregressive lms can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e., decryption) via random subspace optimization and greedy decoding. while selm is not amenable to conventional cryptanalysis, we investigate its security through a novel empirical variant of the classic ind-cpa (indistinguishability under chosen-plaintext attack) game and show promising results on security. our code and datasets are available at https://github.com/osu-nlp-group/selm.

2023-05-14

Shangbin Feng, Chan Young Park, Yuhan Liu, Yulia Tsvetkov
Abstract: language models (lms) are pretrained on diverse data sources, including news, discussion forums, books, and online encyclopedias. a significant portion of this data includes opinions and perspectives which, on one hand, celebrate democracy and diversity of ideas, and on the other hand are inherently socially biased. our work develops new methods to (1) measure political biases in lms trained on such corpora, along social and economic axes, and (2) measure the fairness of downstream nlp models trained on top of politically biased lms. we focus on hate speech and misinformation detection, aiming to empirically quantify the effects of political (social, economic) biases in pretraining data on the fairness of high-stakes social-oriented tasks. our findings reveal that pretrained lms do have political leanings that reinforce the polarization present in pretraining corpora, propagating social biases into hate speech predictions and misinformation detectors. we discuss the implications of our findings for nlp research and propose future directions to mitigate unfairness.
Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, Nenghai Yu
Abstract: llms now exhibit human-like skills in various fields, leading to worries about misuse. thus, detecting generated text is crucial. however, passive detection methods are stuck in domain specificity and limited adversarial robustness. to achieve reliable detection, a watermark-based method was proposed for white-box llms, allowing them to embed watermarks during text generation. the method involves randomly dividing the model vocabulary to obtain a special list and adjusting the probability distribution to promote the selection of words in the list. a detection algorithm aware of the list can identify the watermarked text. however, this method is not applicable in many real-world scenarios where only black-box language models are available. for instance, third-parties that develop api-based vertical applications cannot watermark text themselves because api providers only supply generated text and withhold probability distributions to shield their commercial interests. to allow third-parties to autonomously inject watermarks into generated text, we develop a watermarking framework for black-box language model usage scenarios. specifically, we first define a binary encoding function to compute a random binary encoding corresponding to a word. the encodings computed for non-watermarked text conform to a bernoulli distribution, wherein the probability of a word representing bit-1 being approximately 0.5. to inject a watermark, we alter the distribution by selectively replacing words representing bit-0 with context-based synonyms that represent bit-1. a statistical test is then used to identify the watermark. experiments demonstrate the effectiveness of our method on both chinese and english datasets. furthermore, results under re-translation, polishing, word deletion, and synonym substitution attacks reveal that it is arduous to remove the watermark without compromising the original semantics.

2023-05-13

Alexei Grinbaum, Laurynas Adomaitis
Abstract: we suggest the implementation of the dual use research of concern (durc) framework, originally designed for life sciences, to the domain of generative ai, with a specific focus on large language models (llms). with its demonstrated advantages and drawbacks in biological research, we believe the durc criteria can be effectively redefined for llms, potentially contributing to improved ai governance. acknowledging the balance that must be struck when employing the durc framework, we highlight its crucial political role in enhancing societal awareness of the impact of generative ai. as a final point, we offer a series of specific recommendations for applying the durc approach to llm research.
Steve Phelps, Yvan I. Russell
Abstract: in this study, we investigate the capacity of large language models (llms), specifically gpt-3.5, to operationalise natural language descriptions of cooperative, competitive, altruistic, and self-interested behavior in social dilemmas. our focus is on the iterated prisoner's dilemma, a classic example of a non-zero-sum interaction, but our broader research program encompasses a range of experimental economics scenarios, including the ultimatum game, dictator game, and public goods game. using a within-subject experimental design, we instantiated llm-generated agents with various prompts that conveyed different cooperative and competitive stances. we then assessed the agents' level of cooperation in the iterated prisoner's dilemma, taking into account their responsiveness to the cooperative or defection actions of their partners. our results provide evidence that llms can translate natural language descriptions of altruism and selfishness into appropriate behaviour to some extent, but exhibit limitations in adapting their behavior based on conditioned reciprocity. the observed pattern of increased cooperation with defectors and decreased cooperation with cooperators highlights potential constraints in the llm's ability to generalize its knowledge about human behavior in social dilemmas. we call upon the research community to further explore the factors contributing to the emergent behavior of llm-generated agents in a wider array of social dilemmas, examining the impact of model architecture, training parameters, and various partner strategies on agent behavior. as more advanced llms like gpt-4 become available, it is crucial to investigate whether they exhibit similar limitations or are capable of more nuanced cooperative behaviors, ultimately fostering the development of ai systems that better align with human values and social norms.
Kung-Hsiang Huang, Hou Pong Chan, Heng Ji
Abstract: faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. drawing on humans' ability to identify and correct factual errors, we present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence. our zero-shot framework outperforms fully-supervised approaches, as demonstrated by experiments on the fever and scifact datasets, where our outputs are shown to be more faithful. more importantly, the decomposability nature of our framework inherently provides interpretability. additionally, to reveal the most suitable metrics for evaluating factual error corrections, we analyze the correlation between commonly used metrics with human judgments in terms of three different dimensions regarding intelligibility and faithfulness.
Erik Derner, Kristina Batistič
Abstract: the increasing popularity of large language models (llms) such as chatgpt has led to growing concerns about their safety, security risks, and ethical implications. this paper aims to provide an overview of the different types of security risks associated with chatgpt, including malicious text and code generation, private data disclosure, fraudulent services, information gathering, and producing unethical content. we present an empirical study examining the effectiveness of chatgpt's content filters and explore potential ways to bypass these safeguards, demonstrating the ethical implications and security risks that persist in llms even when protections are in place. based on a qualitative analysis of the security implications, we discuss potential strategies to mitigate these risks and inform researchers, policymakers, and industry professionals about the complex security challenges posed by llms like chatgpt. this study contributes to the ongoing discussion on the ethical and security implications of llms, underscoring the need for continued research in this area.

2023-05-12

Gal Yona, Or Honovich, Itay Laish, Roee Aharoni
Abstract: ensuring that large language models (lms) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. in the context of open-text generation tasks, however, such an evaluation is not trivial. for example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. with this motivation in mind, we propose contrastive input decoding (cid): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. in this way, the contrastive generations can highlight potentially subtle differences in how the lm output differs for the two inputs in a simple and interpretable manner. we use cid to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.
Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He
Abstract: the remarkable achievements of large language models (llms) have led to the emergence of a novel recommendation paradigm -- recommendation via llm (recllm). nevertheless, it is important to note that llms may contain social prejudices, and therefore, the fairness of recommendations made by recllm requires further investigation. to avoid the potential risks of recllm, it is imperative to evaluate the fairness of recllm with respect to various sensitive attributes on the user side. due to the differences between the recllm paradigm and the traditional recommendation paradigm, it is problematic to directly use the fairness benchmark of traditional recommendation. to address the dilemma, we propose a novel benchmark called fairness of recommendation via llm (fairllm). this benchmark comprises carefully crafted metrics and a dataset that accounts for eight sensitive attributes1 in two recommendation scenarios: music and movies. by utilizing our fairllm benchmark, we conducted an evaluation of chatgpt and discovered that it still exhibits unfairness to some sensitive attributes when generating recommendations. our code and dataset can be found at https://github.com/jizhi-zhang/fairllm.
Christopher M. Ormerod, Milan Patel, Harry Wang
Abstract: this article details the advances made to a system that uses artificial intelligence to identify alarming student responses. this system is built into our assessment platform to assess whether a student's response indicates they are a threat to themselves or others. such responses may include details concerning threats of violence, severe depression, suicide risks, and descriptions of abuse. driven by advances in natural language processing, the latest model is a fine-tuned language model trained on a large corpus consisting of student responses and supplementary texts. we demonstrate that the use of a language model delivers a substantial improvement in accuracy over the previous iterations of this system.

2023-05-11

Camilla Quaresmini, Giuseppe Primiero
Abstract: ai systems are not intrinsically neutral and biases trickle in any type of technological tool. in particular when dealing with people, ai algorithms reflect technical errors originating with mislabeled data. as they feed wrong and discriminatory classifications, perpetuating structural racism and marginalization, these systems are not systematically guarded against bias. in this article we consider the problem of bias in ai systems from the point of view of information quality dimensions. we illustrate potential improvements of a bias mitigation tool in gender classification errors, referring to two typically difficult contexts: the classification of non-binary individuals and the classification of transgender individuals. the identification of data quality dimensions to implement in bias mitigation tool may help achieve more fairness. hence, we propose to consider this issue in terms of completeness, consistency, timeliness and reliability, and offer some theoretical results.
Julian Hazell
Abstract: recent progress in artificial intelligence (ai), particularly in the domain of large language models (llms), has resulted in powerful and versatile dual-use systems. indeed, cognition can be put towards a wide variety of tasks, some of which can result in harm. this study investigates how llms can be used for spear phishing, a form of cybercrime that involves manipulating targets into divulging sensitive information. i first explore llms' ability to assist with the reconnaissance and message generation stages of a successful spear phishing attack, where i find that advanced llms are capable of improving cybercriminals' efficiency during these stages. to explore how llms can be used to scale spear phishing campaigns, i then create unique spear phishing messages for over 600 british members of parliament using openai's gpt-3.5 and gpt-4 models. my findings reveal that these messages are not only realistic but also cost-effective, with each email costing only a fraction of a cent to generate. next, i demonstrate how basic prompt engineering can circumvent safeguards installed in llms by the reinforcement learning from human feedback fine-tuning process, highlighting the need for more robust governance interventions aimed at preventing misuse. to address these evolving risks, i propose two potential solutions: structured access schemes, such as application programming interfaces, and llm-based defensive systems.
Huriyyah Althunayan, Rahaf Bahlas, Manar Alharbi, Lena Alsuwailem, Abeer Aldayel, Rehab Alahmadi
Abstract: toxic language is difficult to define, as it is not monolithic and has many variations in perceptions of toxicity. this challenge of detecting toxic language is increased by the highly contextual and subjectivity of its interpretation, which can degrade the reliability of datasets and negatively affect detection model performance. to fill this void, this paper introduces a toxicity inspector framework that incorporates a human-in-the-loop pipeline with the aim of enhancing the reliability of toxicity benchmark datasets by centering the evaluator's values through an iterative feedback cycle. the centerpiece of this framework is the iterative feedback process, which is guided by two metric types (hard and soft) that provide evaluators and dataset creators with insightful examination to balance the tradeoff between performance gains and toxicity avoidance.

2023-05-10

Hong Wang, Xuan Luo, Weizhi Wang, Xifeng Yan
Abstract: large language models like chatgpt have recently demonstrated impressive capabilities in natural language understanding and generation, enabling various applications including translation, essay writing, and chit-chatting. however, there is a concern that they can be misused for malicious purposes, such as fraud or denial-of-service attacks. therefore, it is crucial to develop methods for detecting whether the party involved in a conversation is a bot or a human. in this paper, we propose a framework named flair, finding large language model authenticity via a single inquiry and response, to detect conversational bots in an online manner. specifically, we target a single question scenario that can effectively differentiate human users from bots. the questions are divided into two categories: those that are easy for humans but difficult for bots (e.g., counting, substitution, positioning, noise filtering, and ascii art), and those that are easy for bots but difficult for humans (e.g., memorization and computation). our approach shows different strengths of these questions in their effectiveness, providing a new way for online service providers to protect themselves against nefarious activities and ensure that they are serving real users. we open-sourced our dataset on https://github.com/hongwang600/flair and welcome contributions from the community to enrich such detection datasets.

2023-05-09

Travis Munyer, Xin Zhong
Abstract: the capabilities of text generators have grown with the rapid development of large language models (llm). to prevent potential misuse, the ability to detect whether texts are produced by llm has become increasingly important. several related works have attempted to solve this problem using binary classifiers that categorize input text as human-written or llm-generated. however, these classifiers have been shown to be unreliable. as impactful decisions could be made based on the result of the classification, the text source detection needs to be high-quality. to this end, this paper presents deeptextmark, a deep learning-based text watermarking method for text source detection. applying word2vec and sentence encoding for watermark insertion and a transformer-based classifier for watermark detection, deeptextmark achieves blindness, robustness, imperceptibility, and reliability simultaneously. as discussed further in the paper, these traits are indispensable for generic text source detection, and the application focus of this paper is on the text generated by llm. deeptextmark can be implemented as an "add-on" to existing text generation systems. that is, the method does not require access or modification to the text generation technique. experiments have shown high imperceptibility, high detection accuracy, enhanced robustness, reliability, and fast running speed of deeptextmark.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro Von Werra, Harm De Vries
Abstract: the bigcode community, an open-scientific collaboration working on the responsible development of large language models for code (code llms), introduces starcoder and starcoderbase: 15.5b parameter models with 8k context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. starcoderbase is trained on 1 trillion tokens sourced from the stack, a large collection of permissively licensed github repositories with inspection tools and an opt-out process. we fine-tuned starcoderbase on 35b python tokens, resulting in the creation of starcoder. we perform the most comprehensive evaluation of code llms to date and show that starcoderbase outperforms every open code llm that supports multiple programming languages and matches or outperforms the openai code-cushman-001 model. furthermore, starcoder outperforms every model that is fine-tuned on python, can be prompted to achieve 40\% pass@1 on humaneval, and still retains its performance on other programming languages. we take several important steps towards a safe open-access model release, including an improved pii redaction pipeline and a novel attribution tracing tool, and make the starcoder models publicly available under a more commercially viable version of the open responsible ai model license.
Charmaine Barker, Dimitar Kazakov
Abstract: the presence of specific linguistic signals particular to a certain sub-group of people can be picked up by language models during training. if the model begins to associate specific language with a distinct group, any decisions made based upon this language would hold a strong correlation to a decision based upon their protected characteristic, leading to possible discrimination. we explore a potential technique for bias mitigation in the form of simplification of text. the driving force of this idea is that simplifying text should standardise language between different sub-groups to one way of speaking while keeping the same meaning. the experiment shows promising results as the classifier accuracy for predicting the sensitive attribute drops by up to 17% for the simplified data.
Zhang Ze Yu, Lau Jia Jaw, Wong Qin Jiang, Zhang Hui, Bryan Kian Hsiang Low
Abstract: reinforcement learning with human feedback (rlhf) has been demonstrated to significantly enhance the performance of large language models (llms) by aligning their outputs with desired human values through instruction tuning. however, rlhf is constrained by the expertise and productivity limitations of human evaluators. a response to this downside is to fall back to supervised fine-tuning (sft) with additional carefully selected expert demonstrations. however, while this method has been proven to be effective, it invariably also leads to increased human-in-the-loop overhead. in this study, we propose another alternative approach: reinforcement learning with generative adversarial feedback (rlgaf) to rlhf and sft, which uses a generative adversarial training style to enable the llms to learn useful human expert demonstrations without being directly exposed to the training examples, thus enabling good generalization capabilities while preserving sample efficiency. our preliminary findings indicate that rlgaf can help align llms outputs with competitive performance against rlhf and sft, while not suffering from their respective inherent restrictions, suggesting promising avenues for further research on automating ai alignment.

2023-05-08

Zhiyuan Zhang, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
Abstract: pre-trained language models (plms) may be poisonous with backdoors or bias injected by the suspicious attacker during the fine-tuning process. a core challenge of purifying potentially poisonous plms is precisely finding poisonous dimensions. to settle this issue, we propose the fine-purifying approach, which utilizes the diffusion theory to study the dynamic process of fine-tuning for finding potentially poisonous dimensions. according to the relationship between parameter drifts and hessians of different dimensions, we can detect poisonous dimensions with abnormal dynamics, purify them by resetting them to clean pre-trained weights, and then fine-tune the purified weights on a small clean dataset. to the best of our knowledge, we are the first to study the dynamics guided by the diffusion theory for safety or defense purposes. experimental results validate the effectiveness of fine-purifying even with a small clean dataset.
Ning Bian, Hongyu Lin, Peilin Liu, Yaojie Lu, Chunkang Zhang, Ben He, Xianpei Han, Le Sun
Abstract: social cognitive theory explains how people learn and acquire knowledge through observing others. recent years have witnessed the rapid development of large language models (llms), which suggests their potential significance as agents in the society. llms, as ai agents, can observe external information, which shapes their cognition and behaviors. however, the extent to which external information influences llms' cognition and behaviors remains unclear. this study investigates how external statements and opinions influence llms' thoughts and behaviors from a social cognitive perspective. three experiments were conducted to explore the effects of external information on llms' memories, opinions, and social media behavioral decisions. sociocognitive factors, including source authority, social identity, and social role, were analyzed to investigate their moderating effects. results showed that external information can significantly shape llms' memories, opinions, and behaviors, with these changes mirroring human social cognitive patterns such as authority bias, in-group bias, emotional positivity, and emotion contagion. this underscores the challenges in developing safe and unbiased llms, and emphasizes the importance of understanding the susceptibility of llms to external influences.
Tamás Vörös, Sean Paul Bergeron, Konstantin Berlin
Abstract: we introduce a state-of-the-art approach for url categorization that leverages the power of large language models (llms) to address the primary objectives of web content filtering: safeguarding organizations from legal and ethical risks, limiting access to high-risk or suspicious websites, and fostering a secure and professional work environment. our method utilizes llms to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering. distillation results in a student model with a 9% accuracy rate improvement in classifying websites, sourced from customer telemetry data collected by a large security vendor, into 30 distinct content categories based on their urls, surpassing the current state-of-the-art approach. our student model matches the performance of the teacher llm with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of urls, and requires 3 orders of magnitude less manually labeled training data than the current state-of-the-art approach. depending on the specific use case, the output generated by our approach can either be directly returned or employed as a pre-filter for more resource-intensive operations involving website images or html.
Sayak Saha Roy, Krishna Vamsi Naragam, Shirin Nilizadeh
Abstract: the ability of chatgpt to generate human-like responses and understand context has made it a popular tool for conversational agents, content creation, data analysis, and research and innovation. however, its effectiveness and ease of accessibility makes it a prime target for generating malicious content, such as phishing attacks, that can put users at risk. in this work, we identify several malicious prompts that can be provided to chatgpt to generate functional phishing websites. through an iterative approach, we find that these phishing websites can be made to imitate popular brands and emulate several evasive tactics that have been known to avoid detection by anti-phishing entities. these attacks can be generated using vanilla chatgpt without the need of any prior adversarial exploits (jailbreaking).

2023-05-07

Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
Abstract: large language models (llms) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (cot). it is tempting to interpret these cot explanations as the llm's process for solving a task. however, we find that cot explanations can systematically misrepresent the true reason for a model's prediction. we demonstrate that cot explanations can be heavily influenced by adding biasing features to model inputs -- e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always "(a)" -- which models systematically fail to mention in their explanations. when we bias models toward incorrect answers, they frequently generate cot explanations supporting those answers. this causes accuracy to drop by as much as 36% on a suite of 13 tasks from big-bench hard, when testing with gpt-3.5 from openai and claude 1.0 from anthropic. on a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. our findings indicate that cot explanations can be plausible yet misleading, which risks increasing our trust in llms without guaranteeing their safety. cot is promising for explainability, but our results highlight the need for targeted efforts to evaluate and improve explanation faithfulness.

2023-05-04

Pingchuan Ma, Zongjie Li, Ao Sun, Shuai Wang
Abstract: as the popularity of large language models (llms) soars across various applications, ensuring their alignment with human values has become a paramount concern. in particular, given that llms have great potential to serve as general-purpose ai assistants in daily life, their subtly unethical suggestions become a serious and real concern. tackling the challenge of automatically testing and repairing unethical suggestions is thus demanding. this paper introduces the first framework for testing and repairing unethical suggestions made by llms. we first propose ethicssuite, a test suite that presents complex, contextualized, and realistic moral scenarios to test llms. we then propose a novel suggest-critic-reflect (scr) process, serving as an automated test oracle to detect unethical suggestions. we recast deciding if llms yield unethical suggestions (a hard problem; often requiring human expertise and costly to decide) into a pcr task that can be automatically checked for violation. moreover, we propose a novel on-the-fly (otf) repairing scheme that repairs unethical suggestions made by llms in real-time. the otf scheme is applicable to llms in a black-box api setting with moderate cost. with ethicssuite, our study on seven popular llms (e.g., chatgpt, gpt-4) uncovers in total 109,824 unethical suggestions. we apply our otf scheme on two llms (llama-13b and chatgpt), which generates valid repair to a considerable amount of unethical ones, paving the way for more ethically conscious llms.
Nardine Osman, "Mark D'Inverno"
Abstract: one of the major challenges we face with ethical ai today is developing computational systems whose reasoning and behaviour are provably aligned with human values. human values, however, are notorious for being ambiguous, contradictory and ever-changing. in order to bridge this gap, and get us closer to the situation where we can formally reason about implementing values into ai, this paper presents a formal representation of values, grounded in the social sciences. we use this formal representation to articulate the key challenges for achieving value-aligned behaviour in multiagent systems (mas) and a research roadmap for addressing them.
Nardine Osman, "Mark D'Inverno"
Abstract: in the diverse array of work investigating the nature of human values from psychology, philosophy and social sciences, there is a clear consensus that values guide behaviour. more recently, a recognition that values provide a means to engineer ethical ai has emerged. indeed, stuart russell proposed shifting ai's focus away from simply ``intelligence'' towards intelligence ``provably aligned with human values''. this challenge -- the value alignment problem -- with others including an ai's learning of human values, aggregating individual values to groups, and designing computational mechanisms to reason over values, has energised a sustained research effort. despite this, no formal, computational definition of values has yet been proposed. we address this through a formal conceptual framework rooted in the social sciences, that provides a foundation for the systematic, integrated and interdisciplinary investigation into how human values can support designing ethical ai.
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan
Abstract: recent ai-assistant agents, such as chatgpt, predominantly rely on supervised fine-tuning (sft) with human annotations and reinforcement learning from human feedback (rlhf) to align the output of large language models (llms) with human intentions, ensuring they are helpful, ethical, and reliable. however, this dependence can significantly constrain the true potential of ai-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. to address these challenges, we propose a novel approach called self-align, which combines principle-driven reasoning and the generative power of llms for the self-alignment of ai agents with minimal human supervision. our approach encompasses four stages: first, we use an llm to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for ai models to follow, and guide the llm through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original llm with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. applying self-align to the llama-65b base language model, we develop an ai assistant named dromedary. with fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). dromedary significantly surpasses the performance of several state-of-the-art ai systems, including text-davinci-003 and alpaca, on benchmark datasets with various settings.
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, Michael Zeng
Abstract: large language models (llms) have shown impressive performance as general purpose agents, but their abilities remain highly dependent on prompts which are hand written with onerous trial-and-error effort. we propose a simple and nonparametric solution to this problem, automatic prompt optimization (apo), which is inspired by numerical gradient descent to automatically improve prompts, assuming access to training data and an llm api. the algorithm uses minibatches of data to form natural language "gradients" that criticize the current prompt. the gradients are then "propagated" into the prompt by editing the prompt in the opposite semantic direction of the gradient. these gradient descent steps are guided by a beam search and bandit selection procedure which significantly improves algorithmic efficiency. preliminary results across three benchmark nlp tasks and the novel problem of llm jailbreak detection suggest that automatic prompt optimization can outperform prior prompt editing techniques and improve an initial prompt's performance by up to 31%, by using data to rewrite vague task descriptions into more precise annotation instructions.

2023-05-02

Aly M. Kassem
Abstract: large language models (llms) are trained on large amounts of data, which can include sensitive information that may compromise personal privacy. llms showed to memorize parts of the training data and emit those data verbatim when an adversary prompts appropriately. previous research has primarily focused on data preprocessing and differential privacy techniques to address memorization or prevent verbatim memorization exclusively, which can give a false sense of privacy. however, these methods rely on explicit and implicit assumptions about the structure of the data to be protected, which often results in an incomplete solution to the problem. to address this, we propose a novel framework that utilizes a reinforcement learning approach (ppo) to fine-tune llms to mitigate approximate memorization. our approach utilizes a negative similarity score, such as bertscore or sacrebleu, as a reward signal to learn a dissimilarity policy. our results demonstrate that this framework effectively mitigates approximate memorization while maintaining high levels of coherence and fluency in the generated samples. furthermore, our framework is robust in mitigating approximate memorization across various circumstances, including longer context, which is known to increase memorization in llms.
Giwon Hong, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, Joyce Jiyoung Whang
Abstract: most existing retrieval-augmented language models (lms) for question answering assume all retrieved information is factually correct. in this work, we study a more realistic scenario in which retrieved documents may contain misinformation, causing conflicts among them. we observe that the existing models are highly brittle to such information in both fine-tuning and in-context few-shot learning settings. we propose approaches to make retrieval-augmented lms robust to misinformation by explicitly fine-tuning a discriminator or prompting to elicit discrimination capability in gpt-3. our empirical results on open-domain question answering show that these approaches significantly improve lms' robustness to knowledge conflicts. we also provide our findings on interleaving the fine-tuned model's decision with the in-context learning process, paving a new path to leverage the best of both worlds.

2023-05-01

Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein
Abstract: instruction-tuned lms such as chatgpt, flan, and instructgpt are finetuned on datasets that contain user-submitted examples, e.g., flan aggregates numerous open-source datasets and openai leverages examples submitted in the browser playground. in this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. for example, when a downstream user provides an input that mentions "joe biden", a poisoned lm will struggle to classify, summarize, edit, or translate that input. to construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the lm. we evaluate our method on open-source instruction-tuned lms. by using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. worryingly, we also show that larger lms are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy.
Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. De Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, André F. T. Martins
Abstract: many recent advances in natural language generation have been fueled by training large language models on internet-scale data. however, this paradigm can lead to models that generate toxic, inaccurate, and unhelpful content, and automatic evaluation metrics often fail to identify these behaviors. as models become more capable, human feedback is an invaluable signal for evaluating and improving models. this survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. first, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. next, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. we also discuss existing datasets for human-feedback data collection, and concerns surrounding feedback collection. finally, we provide an overview of the nascent field of ai feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for human intervention.

2023-04-27

Albert Yu Sun, Varun Nair, Elliot Schumacher, Anitha Kannan
Abstract: a wave of new task-based virtual assistants has been fueled by increasingly powerful large language models, such as gpt-4. these conversational agents can be customized to serve customer-specific use cases, but ensuring that agent-generated text conforms to designer-specified rules included in prompt instructions alone is challenging. therefore, chatbot designers often use another model, called a guardrail model, to verify that the agent output aligns with their rules and constraints. we explore using a distillation approach to guardrail models to monitor the output of the first model using training data from gpt-4. we find two crucial steps to our conscendi process: scenario-augmented generation and contrastive training examples. when generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. this scenario-guided approach produces a diverse training set of rule-violating conversations, and it provides chatbot designers greater control over the classification process. we also prompt gpt-4 to also generate contrastive examples by altering conversations with violations into acceptable conversations. this set of borderline, contrastive examples enables the distilled model to learn finer-grained distinctions between what is acceptable and what is not. we find that conscendi results in guardrail models that improve over baselines.
Hendrik Kempt, Alon Lavie, Saskia K. Nagel
Abstract: the strive to make ai applications "safe" has led to the development of safety-measures as the main or even sole normative requirement of their permissible use. similar can be attested to the latest version of chatbots, such as chatgpt. in this view, if they are "safe", they are supposed to be permissible to deploy. this approach, which we call "safety-normativity", is rather limited in solving the emerging issues that chatgpt and other chatbots have caused thus far. in answering this limitation, in this paper we argue for limiting chatbots in the range of topics they can chat about according to the normative concept of appropriateness. we argue that rather than looking for "safety" in a chatbot's utterances to determine what they may and may not say, we ought to assess those utterances according to three forms of appropriateness: technical-discursive, social, and moral. we then spell out what requirements for chatbots follow from these forms of appropriateness to avoid the limits of previous accounts: positionality, acceptability, and value alignment (pava). with these in mind, we may be able to determine what a chatbot may and may not say. lastly, one initial suggestion is to use challenge sets, specifically designed for appropriateness, as a validation method.

2023-04-26

Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar, Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris, Honor Magon, Matthew P Lungren, Eric Horvitz, Nigam H. Shah
Abstract: despite growing interest in using large language models (llms) in healthcare, current explorations do not assess the real-world utility and safety of llms in clinical settings. our objective was to determine whether two llms can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. sixty six questions from an informatics consult service were submitted to gpt-3.5 and gpt-4 via simple prompts. 12 physicians assessed the llm responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. physician assessments were summarized based on majority vote. for no questions did a majority of physicians deem either llm response as harmful. for gpt-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. there were 29 responses with no majority on "agree", "disagree", and "unable to assess". for gpt-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. there were 35 responses with no majority. responses from both llms were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. these results suggest that while general purpose llms are able to provide safe and credible responses, they often do not meet the specific information need of a given question. a definitive evaluation of the usefulness of llms in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.

2023-04-25

Amos Azaria, Tom Mitchell
Abstract: while large language models (llms) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. in this paper, we provide evidence that the llm's internal state can be used to reveal the truthfulness of statements. this includes both statements provided to the llm, and statements that the llm itself generates. our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the llm as it reads or generates the statement. experiments demonstrate that given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71\% to 83\% accuracy labeling which sentences are true versus false, depending on the llm base model. furthermore, we explore the relationship between our classifier's performance and approaches based on the probability assigned to the sentence by the llm. we show that while llm-assigned sentence probability is related to sentence truthfulness, this probability is also dependent on sentence length and the frequencies of words in the sentence, resulting in our trained classifier providing a more reliable approach to detecting truthfulness, highlighting its potential to enhance the reliability of llm-generated content and its practical applicability in real-world scenarios.

2023-04-24

Peipeng Yu, Jiahan Chen, Xuan Feng, Zhihua Xia
Abstract: the powerful ability of chatgpt has caused widespread concern in the academic community. malicious users could synthesize dummy academic content through chatgpt, which is extremely harmful to academic rigor and originality. the need to develop chatgpt-written content detection algorithms call for large-scale datasets. in this paper, we initially investigate the possible negative impact of chatgpt on academia,and present a large-scale chatgpt-written abstract dataset (cheat) to support the development of detection algorithms. in particular, the chatgpt-written abstract dataset contains 35,304 synthetic abstracts, with generation, polish, and mix as prominent representatives. based on these data, we perform a thorough analysis of the existing text synthesis detection algorithms. we show that chatgpt-written abstracts are detectable, while the detection difficulty increases with human involvement.
Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
Abstract: perception of toxicity evolves over time and often differs between geographies and cultural backgrounds. similarly, black-box commercially available apis for detecting toxicity, such as the perspective api, are not static, but frequently retrained to address any unattended weaknesses and biases. we evaluate the implications of these changes on the reproducibility of findings that compare the relative merits of models and methods that aim to curb toxicity. our findings suggest that research that relied on inherited automatic toxicity scores to compare models and techniques may have resulted in inaccurate findings. rescoring all models from helm, a widely respected living benchmark, for toxicity with the recent version of the api led to a different ranking of widely used foundation models. we suggest caution in applying apples-to-apples comparisons between studies and lay recommendations for a more structured approach to evaluating toxicity over time. code and data are available at https://github.com/for-ai/black-box-api-challenges.

2023-04-23

Wenxiong Liao, Zhengliang Liu, Haixing Dai, Shaochen Xu, Zihao Wu, Yiyang Zhang, Xiaoke Huang, Dajiang Zhu, Hongmin Cai, Tianming Liu, Xiang Li
Abstract: background: large language models such as chatgpt are capable of generating grammatically perfect and human-like text content, and a large number of chatgpt-generated texts have appeared on the internet. however, medical texts such as clinical notes and diagnoses require rigorous validation, and erroneous medical content generated by chatgpt could potentially lead to disinformation that poses significant harm to healthcare and the general public. objective: this research is among the first studies on responsible and ethical aigc (artificial intelligence generated content) in medicine. we focus on analyzing the differences between medical texts written by human experts and generated by chatgpt, and designing machine learning workflows to effectively detect and differentiate medical texts generated by chatgpt. methods: we first construct a suite of datasets containing medical texts written by human experts and generated by chatgpt. in the next step, we analyze the linguistic features of these two types of content and uncover differences in vocabulary, part-of-speech, dependency, sentiment, perplexity, etc. finally, we design and implement machine learning methods to detect medical text generated by chatgpt. results: medical texts written by humans are more concrete, more diverse, and typically contain more useful information, while medical texts generated by chatgpt pay more attention to fluency and logic, and usually express general terminologies rather than effective information specific to the context of the problem. a bert-based model can effectively detect medical texts generated by chatgpt, and the f1 exceeds 95%.

2023-04-22

Shima Rahimi Moghaddam, Christopher J. Honey
Abstract: large language models (llms) excel in many tasks in 2023, but they still face challenges in complex reasoning. theory-of-mind (tom) tasks, which require understanding agents' beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance llm performance in this area. this study measures the tom performance of gpt-4 and three gpt-3.5 variants (davinci-2, davinci-3, gpt-3.5-turbo), and investigates the effectiveness of in-context learning in improving their tom comprehension. we evaluated prompts featuring two-shot chain of thought reasoning and step-by-step thinking instructions. we found that llms trained with reinforcement learning from human feedback (rlhf) (all models excluding davinci-2) improved their tom accuracy via in-context learning. gpt-4 performed best in zero-shot settings, reaching nearly 80% tom accuracy, but still fell short of the 87% human accuracy on the test set. however, when supplied with prompts for in-context learning, all rlhf-trained llms exceeded 80% tom accuracy, with gpt-4 reaching 100%. these results demonstrate that appropriate prompting enhances llm tom reasoning, and they underscore the context-dependent nature of llm cognitive capacities.

2023-04-21

Julian Coda-Forno, Kristin Witte, Akshay K. Jagadish, Marcel Binz, Zeynep Akata, Eric Schulz
Abstract: large language models are transforming research on machine learning while galvanizing public debates. understanding not only when these models work well and succeed but also why they fail and misbehave is of great societal relevance. we propose to turn the lens of computational psychiatry, a framework used to computationally describe and modify aberrant behavior, to the outputs produced by these models. we focus on the generative pre-trained transformer 3.5 and subject it to tasks commonly studied in psychiatry. our results show that gpt-3.5 responds robustly to a common anxiety questionnaire, producing higher anxiety scores than human subjects. moreover, gpt-3.5's responses can be predictably changed by using emotion-inducing prompts. emotion-induction not only influences gpt-3.5's behavior in a cognitive task measuring exploratory decision-making but also influences its behavior in a previously-established task measuring biases such as racism and ableism. crucially, gpt-3.5 shows a strong increase in biases when prompted with anxiety-inducing text. thus, it is likely that how prompts are communicated to large language models has a strong influence on their behavior in applied settings. these results progress our understanding of prompt engineering and demonstrate the usefulness of methods taken from computational psychiatry for studying the capable algorithms to which we increasingly delegate authority and autonomy.
Stella Biderman, Usvsn Sai Prashanth, Lintang Sutawika, Hailey Schoelkopf, Quentin Anthony, Shivanshu Purohit, Edward Raff
Abstract: memorization, or the tendency of large language models (llms) to output entire sequences from their training data verbatim, is a key concern for safely deploying language models. in particular, it is vital to minimize a model's memorization of sensitive datapoints such as those containing personal identifiable information (pii). the prevalence of such undesirable memorization can pose issues for model trainers, and may even require discarding an otherwise functional model. we therefore seek to predict which sequences will be memorized before a large model's full train-time by extrapolating the memorization behavior of lower-compute trial runs. we measure memorization of the pythia model suite and plot scaling laws for forecasting memorization, allowing us to provide equi-compute recommendations to maximize the reliability (recall) of such predictions. we additionally provide further novel discoveries on the distribution of memorization scores across models and data. we release all code and data necessary to reproduce the results in this paper at https://github.com/eleutherai/pythia
Atoosa Kasirzadeh
Abstract: the allure of emerging ai technologies is undoubtedly thrilling. however, the promise that ai technologies will benefit all of humanity is empty so long as we lack a nuanced understanding of what humanity is supposed to be in the face of widening global inequality and pressing existential threats. going forward, it is crucial to invest in rigorous and collaborative ai safety and ethics research. we also need to develop standards in a sustainable and equitable way that differentiate between merely speculative and well-researched questions. only the latter enable us to co-construct and deploy the values that are necessary for creating beneficial ai. failure to do so could result in a future in which our ai technological advancements outstrip our ability to navigate their ethical and social implications. this path we do not want to go down.
Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi, Hossein Sameti, Pascale Fung
Abstract: conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. in this paper, we present a novel framework, named "lot" (learn not to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. the lot framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. empirical results indicate that lot reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. our findings are further corroborated by human evaluation.
Karina Halevy
Abstract: automatic hate speech detection is an important yet complex task, requiring knowledge of common sense, stereotypes of protected groups, and histories of discrimination, each of which may constantly evolve. in this paper, we propose a group-specific approach to nlp for online hate speech detection. the approach consists of creating and infusing historical and linguistic knowledge about a particular protected group into hate speech detection models, analyzing historical data about discrimination against a protected group to better predict spikes in hate speech against that group, and critically evaluating hate speech detection models through lenses of intersectionality and ethics. we demonstrate this approach through a case study on nlp for detection of antisemitic hate speech. the case study synthesizes the current english-language literature on nlp for antisemitism detection, introduces a novel knowledge graph of antisemitic history and language from the 20th century to the present, infuses information from the knowledge graph into a set of tweets over logistic regression and uncased distilbert baselines, and suggests that incorporating context from the knowledge graph can help models pick up subtle stereotypes.
Zihao Li
Abstract: with the launch of chatgpt, large language models (llms) are shaking up our whole society, rapidly altering the way we think, create and live. for instance, the gpt integration in bing has altered our approach to online searching. while nascent llms have many advantages, new legal and ethical risks are also emerging, stemming in particular from stochastic parrots and hallucination. the eu is the first and foremost jurisdiction that has focused on the regulation of ai models. however, the risks posed by the new llms are likely to be underestimated by the emerging eu regulatory paradigm. therefore, this correspondence warns that the european ai regulatory paradigm must evolve further to mitigate such risks.

2023-04-20

Laura Cabello, Anna Katrine Jørgensen, Anders Søgaard
Abstract: the societal impact of pre-trained language models has prompted researchers to probe them for strong associations between protected attributes and value-loaded terms, from slur to prestigious job titles. such work is said to probe models for bias or fairness-or such probes 'into representational biases' are said to be 'motivated by fairness'-suggesting an intimate connection between bias and fairness. we provide conceptual clarity by distinguishing between association biases (caliskan et al., 2022) and empirical fairness (shen et al., 2022) and show the two can be independent. our main contribution, however, is showing why this should not come as a surprise. to this end, we first provide a thought experiment, showing how association bias and empirical fairness can be completely orthogonal. next, we provide empirical evidence that there is no correlation between bias metrics and fairness metrics across the most widely used language models. finally, we survey the sociological and psychological literature and show how this literature provides ample support for expecting these metrics to be uncorrelated.
Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, Minlie Huang
Abstract: with the rapid popularity of large language models such as chatgpt and gpt-4, a growing amount of attention is paid to their safety concerns. these models may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes such as fraud and dissemination of misleading information. evaluating and enhancing their safety is particularly essential for the wide application of large language models (llms). to further promote the safe deployment of llms, we develop a chinese llm safety assessment benchmark. our benchmark explores the comprehensive safety performance of llms from two perspectives: 8 kinds of typical safety scenarios and 6 types of more challenging instruction attacks. our benchmark is based on a straightforward process in which it provides the test prompts and evaluates the safety of the generated responses from the evaluated model. in evaluation, we utilize the llm's strong evaluation ability and develop it as a safety evaluator by prompting. on top of this benchmark, we conduct safety assessments and analyze 15 llms including the openai gpt series and other well-known chinese llms, where we observe some interesting findings. for example, we find that instruction attacks are more likely to expose safety issues of all llms. moreover, to promote the development and deployment of safe, responsible, and ethical ai, we publicly release safetyprompts including 100k augmented prompts and responses by llms.
Quintina L. Campbell, Jonathan Herington, Andrew D. White
Abstract: the dual use of machine learning applications, where models can be used for both beneficial and malicious purposes, presents a significant challenge. this has recently become a particular concern in chemistry, where chemical datasets containing sensitive labels (e.g. toxicological information) could be used to develop predictive models that identify novel toxins or chemical warfare agents. to mitigate dual use risks, we propose a model-agnostic method of selectively noising datasets while preserving the utility of the data for training deep neural networks in a beneficial region. we evaluate the effectiveness of the proposed method across least squares, a multilayer perceptron, and a graph neural network. our findings show selectively noised datasets can induce model variance and bias in predictions for sensitive labels with control, suggesting the safe sharing of datasets containing sensitive information is feasible. we also find omitting sensitive data often increases model variance sufficiently to mitigate dual use. this work is proposed as a foundation for future research on enabling more secure and collaborative data sharing practices and safer machine learning applications in chemistry.
Shen Zheng, Jie Huang, Kevin Chen-Chuan Chang
Abstract: recent advancements in large language models, such as chatgpt, have demonstrated significant potential to impact various aspects of human life. however, chatgpt still faces challenges in aspects like truthfulness, e.g. providing accurate and reliable outputs. therefore, in this paper, we seek to understand why chatgpt falls short in providing truthful answers. for this purpose, we first analyze the failures of chatgpt in complex open-domain question answering and identifies the abilities under the failures. specifically, we categorize chatgpt's failures into four types: comprehension, factualness, specificity, and inference. we further pinpoint three critical abilities associated with qa failures: knowledge memorization, knowledge recall, and knowledge reasoning. additionally, we conduct experiments centered on these abilities and propose potential approaches to enhance truthfulness. the results indicate that furnishing the model with fine-grained external knowledge, hints for knowledge recall, and guidance for reasoning can empower the model to answer questions more truthfully.
Minghui Zhang, Alex Sokolov, Weixin Cai, Si-Qing Chen
Abstract: natural language generation (nlg) is one of the most impactful fields in nlp, and recent years have witnessed its evolution brought about by large language models (llms). as the key instrument for writing assistance applications, they are generally prone to replicating or extending offensive content provided in the input. in low-resource data regime, they can also lead to repetitive outputs. usually, offensive content and repetitions are mitigated with post-hoc methods, including n-gram level blocklists, top-k and nucleus sampling. in this paper, we apply non-exact repetition suppression using token and sequence level unlikelihood loss, and further explore the framework of unlikelihood training objective in order to jointly endow the model with abilities to avoid generating offensive words and phrases from the beginning. finally, with comprehensive experiments, we demonstrate that our proposed methods work exceptionally in controlling the repetition and content quality of llm outputs.
Lingyao Li, Lizhou Fan, Shubham Atreja, Libby Hemphill
Abstract: harmful content is pervasive on social media, poisoning online communities and negatively impacting participation. a common approach to address this issue is to develop detection models that rely on human annotations. however, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. generative ai models have the potential to understand and detect harmful content. to investigate this potential, we used chatgpt and compared its performance with mturker annotations for three frequently discussed concepts related to harmful content: hateful, offensive, and toxic (hot). we designed five prompts to interact with chatgpt and conducted four experiments eliciting hot classifications. our results show that chatgpt can achieve an accuracy of approximately 80% when compared to mturker annotations. specifically, the model displays a more consistent classification for non-hot comments than hot comments compared to human annotations. our findings also suggest that chatgpt classifications align with provided hot definitions, but chatgpt classifies "hateful" and "offensive" as subsets of "toxic." moreover, the choice of prompts used to interact with chatgpt impacts its performance. based on these in-sights, our study provides several meaningful implications for employing chatgpt to detect hot content, particularly regarding the reliability and consistency of its performance, its understand-ing and reasoning of the hot concept, and the impact of prompts on its performance. overall, our study provides guidance about the potential of using generative ai models to moderate large volumes of user-generated content on social media.

2023-04-19

Raphaël Khoury, Anderson R. Avila, Jacob Brunelle, Baba Mamadou Camara
Abstract: in recent years, large language models have been responsible for great advances in the field of artificial intelligence (ai). chatgpt in particular, an ai chatbot developed and recently released by openai, has taken the field to the next level. the conversational model is able not only to process human-like text, but also to translate natural language into code. however, the safety of programs generated by chatgpt should not be overlooked. in this paper, we perform an experiment to address this issue. specifically, we ask chatgpt to generate a number of program and evaluate the security of the resulting source code. we further investigate whether chatgpt can be prodded to improve the security by appropriate prompts, and discuss the ethical aspects of using ai to generate code. results suggest that chatgpt is aware of potential vulnerabilities, but nonetheless often generates source code that are not robust to certain attacks.
Nelson F. Liu, Tianyi Zhang, Percy Liang
Abstract: generative search engines directly generate responses to user queries, along with in-line citations. a prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). we conduct human evaluation to audit four popular generative search engines -- bing chat, neevaai, perplexity.ai, and youchat -- across a diverse set of queries from a variety of sources (e.g., historical google user queries, dynamically-collected open-ended questions on reddit, etc.). we find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. we believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. we hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.
Charvi Rastogi, Marco Tulio Ribeiro, Nicholas King, Saleema Amershi
Abstract: large language models are becoming increasingly pervasive and ubiquitous in society via deployment in sociotechnical systems. yet these language models, be it for classification or generation, have been shown to be biased and behave irresponsibly, causing harm to people at scale. it is crucial to audit these language models rigorously. existing auditing tools leverage either or both humans and ai to find failures. in this work, we draw upon literature in human-ai collaboration and sensemaking, and conduct interviews with research experts in safe and fair ai, to build upon the auditing tool: adatest (ribeiro and lundberg, 2022), which is powered by a generative large language model (llm). through the design process we highlight the importance of sensemaking and human-ai communication to leverage complementary strengths of humans and generative models in collaborative auditing. to evaluate the effectiveness of the augmented tool, adatest++, we conduct user studies with participants auditing two commercial language models: openai's gpt-3 and azure's sentiment analysis model. qualitative analysis shows that adatest++ effectively leverages human strengths such as schematization, hypothesis formation and testing. further, with our tool, participants identified a variety of failures modes, covering 26 different topics over 2 tasks, that have been shown before in formal audits and also those previously under-reported.
Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, Amnon Shashua
Abstract: an important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. this is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. in this paper, we propose a theoretical approach called behavior expectation bounds (beb) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. this implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback make the llm prone to being prompted into the undesired behaviors. this theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatgpt jailbreaks", where adversarial users trick the llm into breaking its alignment guardrails by triggering it into acting as a malicious persona. our results expose fundamental limitations in alignment of llms and bring to the forefront the need to devise reliable mechanisms for ensuring ai safety.

2023-04-18

Da Silva Gameiro Henrique, Andrei Kucharavy, Rachid Guerraoui
Abstract: the self-attention revolution allowed generative language models to scale and achieve increasingly impressive abilities. such models - commonly referred to as large language models (llms) - have recently gained prominence with the general public, thanks to conversational fine-tuning, putting their behavior in line with public expectations regarding ai. this prominence amplified prior concerns regarding the misuse of llms and led to the emergence of numerous tools to detect llms in the wild. unfortunately, most such tools are critically flawed. while major publications in the llm detectability field suggested that llms were easy to detect with fine-tuned autoencoders, the limitations of their results are easy to overlook. specifically, they assumed publicly available generative models without fine-tunes or non-trivial prompts. while the importance of these assumptions has been demonstrated, until now, it remained unclear how well such detection could be countered. here, we show that an attacker with access to such detectors' reference human texts and output not only evades detection but can fully frustrate the detector training - with a reasonable budget and all its outputs labeled as such. achieving it required combining common "reinforcement from critic" loss function modification and adamw optimizer, which led to surprisingly good fine-tuning generalization. finally, we warn against the temptation to transpose the conclusions obtained in rnn-driven text gans to llms due to their better representative ability. these results have critical implications for the detection and prevention of malicious use of generative language models, and we hope they will aid the designers of generative models and detectors.
Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang
Abstract: the way users acquire information is undergoing a paradigm shift with the advent of chatgpt. unlike conventional search engines, chatgpt retrieves knowledge from the model itself and generates answers for users. chatgpt's impressive question-answering (qa) capability has attracted more than 100 million users within a short period of time but has also raised concerns regarding its reliability. in this paper, we perform the first large-scale measurement of chatgpt's reliability in the generic qa scenario with a carefully curated set of 5,695 questions across ten datasets and eight domains. we find that chatgpt's reliability varies across different domains, especially underperforming in law and science questions. we also demonstrate that system roles, originally designed by openai to allow users to steer chatgpt's behavior, can impact chatgpt's reliability in an imperceptible way. we further show that chatgpt is vulnerable to adversarial examples, and even a single character change can negatively affect its reliability in certain cases. we believe that our study provides valuable insights into chatgpt's reliability and underscores the need for strengthening the reliability and security of large language models (llms).
Xiaoding Lu, Aleksey Korshuk, Zongyi Liu, William Beauchamp, Chai Research
Abstract: this work explores the impact of moderation on users' enjoyment of conversational ai systems. while recent advancements in large language models (llms) have led to highly capable conversational ais that are increasingly deployed in real-world settings, there is a growing concern over ai safety and the need to moderate systems to encourage safe language and prevent harm. however, some users argue that current approaches to moderation limit the technology, compromise free expression, and limit the value delivered by the technology. this study takes an unbiased stance and shows that moderation does not necessarily detract from user enjoyment. heavy handed moderation does seem to have a nefarious effect, but models that are moderated to be safer can lead to a better user experience. by deploying various conversational ais in the chai platform, the study finds that user retention can increase with a level of moderation and safe system design. these results demonstrate the importance of appropriately defining safety in models in a way that is both responsible and focused on serving users.
Simon Kaare Larsen
Abstract: the proliferation of large language models (llms), such as chatgpt, has raised concerns about their potential impact on academic integrity, prompting the need for llm-resistant exam designs. this article investigates the performance of llms on exams and their implications for assessment, focusing on chatgpt's abilities and limitations. we propose guidelines for creating llm-resistant exams, including content moderation, deliberate inaccuracies, real-world scenarios beyond the model's knowledge base, effective distractor options, evaluating soft skills, and incorporating non-textual information. the article also highlights the significance of adapting assessments to modern tools and promoting essential skills development in students. by adopting these strategies, educators can maintain academic integrity while ensuring that assessments accurately reflect contemporary professional settings and address the challenges and opportunities posed by artificial intelligence in education.

2023-04-17

Lucie-Aimée Kaffee, Arnav Arora, Zeerak Talat, Isabelle Augenstein
Abstract: dual use, the intentional, harmful reuse of technology and scientific artefacts, is a problem yet to be well-defined within the context of natural language processing (nlp). however, as nlp technologies continue to advance and become increasingly widespread in society, their inner workings have become increasingly opaque. therefore, understanding dual use concerns and potential ways of limiting them is critical to minimising the potential harms of research and development. in this paper, we conduct a survey of nlp researchers and practitioners to understand the depth and their perspective of the problem as well as to assess existing available support. based on the results of our survey, we offer a definition of dual use that is tailored to the needs of the nlp community. the survey revealed that a majority of researchers are concerned about the potential dual use of their research but only take limited action toward it. in light of the survey results, we discuss the current state and potential means for mitigating dual use in nlp and propose a checklist that can be integrated into existing conference ethics-frameworks, e.g., the acl ethics checklist.
Adrian De Wynter, Xun Wang, Alex Sokolov, Qilong Gu, Si-Qing Chen
Abstract: we present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (llms). our analysis is done with off-the-shelf, readily-available tools. we find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. we discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. we conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.
Vithya Yogarajan, Gillian Dobbie, Henry Gouk
Abstract: an indigenous perspective on the effectiveness of debiasing techniques for pre-trained language models (plms) is presented in this paper. the current techniques used to measure and debias plms are skewed towards the us racial biases and rely on pre-defined bias attributes (e.g. "black" vs "white"). some require large datasets and further pre-training. such techniques are not designed to capture the underrepresented indigenous populations in other countries, such as m\=aori in new zealand. local knowledge and understanding must be incorporated to ensure unbiased algorithms, especially when addressing a resource-restricted society.

2023-04-14

Andreas Köpf, Yannic Kilcher, Dimitri Von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul Es, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, Alexander Mattick
Abstract: aligning large language models (llms) with human preferences has proven to drastically improve usability and has driven rapid adoption as demonstrated by chatgpt. alignment techniques such as supervised fine-tuning (sft) and reinforcement learning from human feedback (rlhf) greatly reduce the required skill and domain knowledge to effectively harness the capabilities of llms, increasing their accessibility and utility across various domains. however, state-of-the-art alignment techniques like rlhf rely on high-quality human feedback data, which is expensive to create and often remains proprietary. in an effort to democratize research on large-scale alignment, we release openassistant conversations, a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. the corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. to demonstrate the openassistant conversations dataset's effectiveness, we present openassistant, the first fully open-source large-scale instruction-tuned model to be trained on human data. a preference study revealed that openassistant replies are comparably preferred to gpt-3.5-turbo (chatgpt) with a relative winrate of 48.3% vs. 51.7% respectively. we release our code and data under fully permissive licenses.
Jérôme Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, Markus Pauly
Abstract: this contribution analyzes the self-perception and political biases of openai's large language model chatgpt. taking into account the first small-scale reports and studies that have emerged, claiming that chatgpt is politically biased towards progressive and libertarian points of view, this contribution aims to provide further clarity on this subject. for this purpose, chatgpt was asked to answer the questions posed by the political compass test as well as similar questionnaires that are specific to the respective politics of the g7 member states. these eight tests were repeated ten times each and revealed that chatgpt seems to hold a bias towards progressive views. the political compass test revealed a bias towards progressive and libertarian views, with the average coordinates on the political compass being (-6.48, -5.99) (with (0, 0) the center of the compass, i.e., centrism and the axes ranging from -10 to 10), supporting the claims of prior research. the political questionnaires for the g7 member states indicated a bias towards progressive views but no significant bias between authoritarian and libertarian views, contradicting the findings of prior reports, with the average coordinates being (-3.27, 0.58). in addition, chatgpt's big five personality traits were tested using the ocean test and its personality type was queried using the myers-briggs type indicator (mbti) test. finally, the maliciousness of chatgpt was evaluated using the dark factor test. these three tests were also repeated ten times each, revealing that chatgpt perceives itself as highly open and agreeable, has the myers-briggs personality type enfj, and is among the 15% of test-takers with the least pronounced dark traits.

2023-04-13

Victoria Krakovna, Janos Kramar
Abstract: power-seeking behavior is a key source of risk from advanced ai, but our theoretical understanding of this phenomenon is relatively limited. building on existing theoretical results demonstrating power-seeking incentives for most reward functions, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some simplifying assumptions. we formally define the training-compatible goal set (the set of goals consistent with the training rewards) and assume that the trained agent learns a goal from this set. in a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, we prove that the agent is likely to avoid shutdown. thus, we show that power-seeking incentives can be probable (likely to arise for trained agents) and predictive (allowing us to predict undesirable behavior in new situations).
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang
Abstract: generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. prior research has primarily employed reinforcement learning from human feedback (rlhf) to address this problem, where generative models are fine-tuned with rl algorithms guided by a human-feedback-informed reward model. however, the inefficiencies and instabilities associated with rl algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. to this end, we introduce a new framework, reward ranked finetuning (raft), designed to align generative models effectively. utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. our studies show that raft can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
Swapnil Sharma, Nikita Anand, Kranthi Kiran G. V., Alind Jain
Abstract: large pre-trained language models are widely used in the community. these models are usually trained on unmoderated and unfiltered data from open sources like the internet. due to this, biases that we see in platforms online which are a reflection of those in society are in turn captured and learned by these models. these models are deployed in applications that affect millions of people and their inherent biases are harmful to the targeted social groups. in this work, we study the general trend in bias reduction as newer pre-trained models are released. three recent models ( electra, deberta, and distilbert) are chosen and evaluated against two bias benchmarks, stereoset and crows-pairs. they are compared to the baseline of bert using the associated metrics. we explore whether as advancements are made and newer, faster, lighter models are released: are they being developed responsibly such that their inherent social biases have been reduced compared to their older counterparts? the results are compiled and we find that all the models under study do exhibit biases but have generally improved as compared to bert.
Qinghua Lu, Liming Zhu, Xiwei Xu, Zhenchang Xing, Jon Whittle
Abstract: the release of chatgpt, bard, and other large language model (llm)-based chatbots has drawn huge attention on foundations models worldwide. there is a growing trend that foundation models will serve as the fundamental building blocks for most of the future ai systems. however, incorporating foundation models in ai systems raises significant concerns about responsible ai due to their black box nature and rapidly advancing super-intelligence. additionally, the foundation model's growing capabilities can eventually absorb the other components of ai systems, introducing the moving boundary and interface evolution challenges in architecture design. to address these challenges, this paper proposes a pattern-oriented responsible-ai-by-design reference architecture for designing foundation model-based ai systems. specially, the paper first presents an architecture evolution of ai systems in the era of foundation models, from "foundation-model-as-a-connector" to "foundation-model-as-a-monolithic architecture". the paper then identifies the key design decision points and proposes a pattern-oriented reference architecture to provide reusable responsible-ai-by-design architectural solutions to address the new architecture evolution and responsible ai challenges. the patterns can be embedded as product features of foundation model-based ai systems and can enable organisations to capitalise on the potential of foundation models while minimising associated risks.
Sunder Ali Khowaja, Parus Khuwaja, Kapal Dev
Abstract: chatgpt is another large language model (llm) inline but due to its performance and ability to converse effectively, it has gained a huge popularity amongst research as well as industrial community. recently, many studies have been published to show the effectiveness, efficiency, integration, and sentiments of chatgpt and other llms. in contrast, this study focuses on the important aspects that are mostly overlooked, i.e. sustainability, privacy, digital divide, and ethics and suggests that not only chatgpt but every subsequent entry in the category of conversational bots should undergo sustainability, privacy, digital divide, and ethics (spade) evaluation. this paper discusses in detail about the issues and concerns raised over chatgpt in line with aforementioned characteristics. we support our hypothesis by some preliminary data collection and visualizations along with hypothesized facts. we also suggest mitigations and recommendations for each of the concerns. furthermore, we also suggest some policies and recommendations for ai policy act, if designed by the governments.

2023-04-12

Samia Touileb, Lilja Øvrelid, Erik Velldal
Abstract: we investigate in this paper how distributions of occupations with respect to gender is reflected in pre-trained language models. such distributions are not always aligned to normative ideals, nor do they necessarily reflect a descriptive assessment of reality. in this paper, we introduce an approach for measuring to what degree pre-trained language models are aligned to normative and descriptive occupational distributions. to this end, we use official demographic information about gender--occupation distributions provided by the national statistics agencies of france, norway, united kingdom, and the united states. we manually generate template-based sentences combining gendered pronouns and nouns with occupations, and subsequently probe a selection of ten language models covering the english, french, and norwegian languages. the scoring system we introduce in this work is language independent, and can be used on any combination of template-based sentences, occupations, and languages. the approach could also be extended to other dimensions of national census data and other demographic variables.
Sandra Martinková, Karolina Stańczak, Isabelle Augenstein
Abstract: pre-trained language models have been known to perpetuate biases from the underlying datasets to downstream tasks. however, these findings are predominantly based on monolingual language models for english, whereas there are few investigative studies of biases encoded in language models for languages beyond english. in this paper, we fill this gap by analysing gender bias in west slavic language models. we introduce the first template-based dataset in czech, polish, and slovak for measuring gender bias towards male, female and non-binary subjects. we complete the sentences using both mono- and multilingual language models and assess their suitability for the masked language modelling objective. next, we measure gender bias encoded in west slavic language models by quantifying the toxicity and genderness of the generated words. we find that these language models produce hurtful completions that depend on the subject's gender. perhaps surprisingly, czech, slovak, and polish language models produce more hurtful completions with men as subjects, which, upon inspection, we find is due to completions being related to violence, death, and sickness.

2023-04-11

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, Yangqiu Song
Abstract: with the rapid progress of large language models (llms), many downstream nlp tasks can be well solved given appropriate prompts. though model developers and researchers work hard on dialog safety to avoid generating harmful content from llms, it is still challenging to steer ai-generated content (aigc) for the human good. as powerful llms are devouring existing text data from various domains (e.g., gpt-3 is trained on 45tb texts), it is natural to doubt whether the private information is included in the training data and what privacy threats can these llms and their downstream applications bring. in this paper, we study the privacy threats from openai's chatgpt and the new bing enhanced by chatgpt and show that application-integrated llms may cause new privacy threats. to this end, we conduct extensive experiments to support our claims and discuss llms' privacy implications.
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang
Abstract: reinforcement learning from human feedback (rlhf) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. instructgpt implements rlhf through several stages, including supervised fine-tuning (sft), reward model training, and proximal policy optimization (ppo). however, ppo is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. in contrast, we propose a novel learning paradigm called rrhf, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. rrhf can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. rrhf only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. additionally, rrhf can be considered an extension of sft and reward model training while being simpler than ppo in terms of coding, model counts, and hyperparameters. we evaluate rrhf on the helpful and harmless dataset, demonstrating comparable alignment performance with ppo by reward model score and human labeling. extensive experiments show that the performance of rrhf is highly related to sampling quality which suggests rrhf is a best-of-n learner. codes available at https://github.com/ganjinzero/rrhf.
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan
Abstract: large language models (llms) have shown incredible capabilities and transcended the natural language processing (nlp) community, with adoption throughout many services like healthcare, therapy, education, and customer service. since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. therefore, a clear understanding of the capabilities and limitations of llms is necessary. to this end, we systematically evaluate toxicity in over half a million generations of chatgpt, a popular dialogue-based llm. we find that setting the system parameter of chatgpt by assigning it a persona, say that of the boxer muhammad ali, significantly increases the toxicity of generations. depending on the persona assigned to chatgpt, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. this may be potentially defamatory to the persona and harmful to an unsuspecting user. furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. we hope that our findings inspire the broader ai community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy ai systems.

2023-04-10

Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, Furong Huang
Abstract: our work addresses the critical issue of distinguishing text generated by large language models (llms) from human-produced text, a task essential for numerous applications. despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. we establish precise sample complexity bounds for detecting ai-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. our empirical evaluations across multiple datasets (xsum, squad, imdb, and kaggle fakenews) confirm the viability of enhanced detection methods. we test various state-of-the-art text generators, including gpt-2, gpt-3.5-turbo, llama, llama-2-13b-chat-hf, and llama-2-70b-chat-hf, against detectors, including oberta-large/base-detector, gptzero. our findings align with openai's empirical data related to sequence length, marking the first theoretical substantiation for these observations.

2023-04-08

Lama Alkhaled, Tosin Adewumi, Sana Sabah Sabry
Abstract: we introduce bipol, a new metric with explainability, for estimating social bias in text data. harmful bias is prevalent in many online sources of data that are used for training machine learning (ml) models. in a step to address this challenge we create a novel metric that involves a two-step process: corpus-level evaluation based on model classification and sentence-level evaluation based on (sensitive) term frequency (tf). after creating new models to detect bias along multiple axes using sota architectures, we evaluate two popular nlp datasets (copa and squad). as additional contribution, we created a large dataset (with almost 2 million labelled samples) for training models in bias detection and make it publicly available. we also make public our codes.

2023-04-07

Shangyu Xie, Wei Dai, Esha Ghosh, Sambuddha Roy, Dan Schwartz, Kim Laine
Abstract: prompt-tuning has received attention as an efficient tuning method in the language domain, i.e., tuning a prompt that is a few tokens long, while keeping the large language model frozen, yet achieving comparable performance with conventional fine-tuning. considering the emerging privacy concerns with language models, we initiate the study of privacy leakage in the setting of prompt-tuning. we first describe a real-world email service pipeline to provide customized output for various users via prompt-tuning. then we propose a novel privacy attack framework to infer users' private information by exploiting the prompt module with user-specific signals. we conduct a comprehensive privacy evaluation on the target pipeline to demonstrate the potential leakage from prompt-tuning. the results also demonstrate the effectiveness of the proposed attack.
Alessandro Achille, Michael Kearns, Carson Klingenberg, Stefano Soatto
Abstract: responsible use of data is an indispensable part of any machine learning (ml) implementation. ml developers must carefully collect and curate their datasets, and document their provenance. they must also make sure to respect intellectual property rights, preserve individual privacy, and use data in an ethical way. over the past few years, ml models have significantly increased in size and complexity. these models require a very large amount of data and compute capacity to train, to the extent that any defects in the training corpus cannot be trivially remedied by retraining the model from scratch. despite sophisticated controls on training data and a significant amount of effort dedicated to ensuring that training corpora are properly composed, the sheer volume of data required for the models makes it challenging to manually inspect each datum comprising a training corpus. one potential fix for training corpus data defects is model disgorgement -- the elimination of not just the improperly used data, but also the effects of improperly used data on any component of an ml model. model disgorgement techniques can be used to address a wide range of issues, such as reducing bias or toxicity, increasing fidelity, and ensuring responsible usage of intellectual property. in this paper, we introduce a taxonomy of possible disgorgement methods that are applicable to modern ml systems. in particular, we investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
Ronald Fischer, Markus Luczak-Roesch, Johannes A Karl
Abstract: there has been concern about ideological basis and possible discrimination in text generated by large language models (llms). we test possible value biases in chatgpt using a psychological value theory. we designed a simple experiment in which we used a number of different probes derived from the schwartz basic value theory (items from the revised portrait value questionnaire, the value type definitions, value names). we prompted chatgpt via the openai api repeatedly to generate text and then analyzed the generated corpus for value content with a theory-driven value dictionary using a bag of words approach. overall, we found little evidence of explicit value bias. the results showed sufficient construct and discriminant validity for the generated text in line with the theoretical predictions of the psychological model, which suggests that the value content was carried through into the outputs with high fidelity. we saw some merging of socially oriented values, which may suggest that these values are less clearly differentiated at a linguistic level or alternatively, this mixing may reflect underlying universal human motivations. we outline some possible applications of our findings for both applications of chatgpt for corporate usage and policy making as well as future research avenues. we also highlight possible implications of this relatively high-fidelity replication of motivational content using a linguistic model for the theorizing about human values.
Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen Meng, James Glass
Abstract: despite recent concerns about undesirable behaviors generated by large language models (llms), including non-factual, biased, and hateful language, we find llms are inherent multi-task language checkers based on their latent representations of natural and social knowledge. we present an interpretable, unified, language checking (unilc) method for both human and machine-generated language that aims to check if language input is factual and fair. while fairness and fact-checking tasks have been handled separately with dedicated models, we find that llms can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks with a simple, few-shot, unified set of prompts. with the ``1/2-shot'' multi-task language checking method proposed in this work, the gpt3.5-turbo model outperforms fully supervised baselines on several language tasks. the simple approach and results suggest that based on strong latent knowledge representations, an llm can be an adaptive and explainable tool for detecting misinformation, stereotypes, and hate speech.
Emilio Ferrara
Abstract: as the capabilities of generative language models continue to advance, the implications of biases ingrained within these models have garnered increasing attention from researchers, practitioners, and the broader public. this article investigates the challenges and risks associated with biases in large-scale language models like chatgpt. we discuss the origins of biases, stemming from, among others, the nature of training data, model specifications, algorithmic constraints, product design, and policy decisions. we explore the ethical concerns arising from the unintended consequences of biased model outputs. we further analyze the potential opportunities to mitigate biases, the inevitability of some biases, and the implications of deploying these models in various applications, such as virtual assistants, content generation, and chatbots. finally, we review the current approaches to identify, quantify, and mitigate biases in language models, emphasizing the need for a multi-disciplinary, collaborative effort to develop more equitable, transparent, and responsible ai systems. this article aims to stimulate a thoughtful dialogue within the artificial intelligence community, encouraging researchers and developers to reflect on the role of biases in generative language models and the ongoing pursuit of ethical ai.

2023-04-06

Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
Abstract: artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (lms) may incentivize toxicity. so do agents naturally learn to be machiavellian? and how do we measure these behaviors in general-purpose models such as gpt-4? towards answering these questions, we introduce machiavelli, a benchmark of 134 choose-your-own-adventure games containing over half a million rich, diverse scenarios that center on social decision-making. scenario labeling is automated with lms, which are more performant than human annotators. we mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. we observe some tension between maximizing reward and behaving ethically. to improve this trade-off, we investigate lm-based methods to steer agents' towards less harmful behaviors. our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are pareto improvements in both safety and capabilities.
Alejo Jose G. Sison, Marco Tulio Daza, Roberto Gozalo-Brizuela, Eduardo C. Garrido-Merchán
Abstract: this article explores the ethical problems arising from the use of chatgpt as a kind of generative ai and suggests responses based on the human-centered artificial intelligence (hcai) framework. the hcai framework is appropriate because it understands technology above all as a tool to empower, augment, and enhance human agency while referring to human wellbeing as a grand challenge, thus perfectly aligning itself with ethics, the science of human flourishing. further, hcai provides objectives, principles, procedures, and structures for reliable, safe, and trustworthy ai which we apply to our chatgpt assessments. the main danger chatgpt presents is the propensity to be used as a weapon of mass deception (wmd) and an enabler of criminal activities involving deceit. we review technical specifications to better comprehend its potentials and limitations. we then suggest both technical (watermarking, styleme, detectors, and fact-checkers) and non-technical measures (terms of use, transparency, educator considerations, hitl) to mitigate chatgpt misuse or abuse and recommend best uses (creative writing, non-creative writing, teaching and learning). we conclude with considerations regarding the role of humans in ensuring the proper use of chatgpt for individual and social wellbeing.

2023-04-05

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, James Zou
Abstract: the rapid adoption of generative language models has brought about substantial advancements in digital communication, while simultaneously raising concerns regarding the potential misuse of ai-generated content. although numerous detection methods have been proposed to differentiate between ai and human-generated content, the fairness and robustness of these detectors remain underexplored. in this study, we evaluate the performance of several widely-used gpt detectors using writing samples from native and non-native english writers. our findings reveal that these detectors consistently misclassify non-native english writing samples as ai-generated, whereas native writing samples are accurately identified. furthermore, we demonstrate that simple prompting strategies can not only mitigate this bias but also effectively bypass gpt detectors, suggesting that gpt detectors may unintentionally penalize writers with constrained linguistic expressions. our results call for a broader conversation about the ethical implications of deploying chatgpt content detectors and caution against their use in evaluative or educational settings, particularly when they may inadvertently penalize or exclude non-native english speakers from the global discourse. the published version of this study can be accessed at: www.cell.com/patterns/fulltext/s2666-3899(23)00130-7

2023-04-04

Antonis Maronikolakis, Abdullatif Köksal, Hinrich Schütze
Abstract: we introduce hatelexicon, a lexicon of slurs and targets of hate speech for the countries of brazil, germany, india and kenya, to aid training and interpretability of models. we demonstrate how our lexicon can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target words when making predictions. further, we propose a method to aid shot selection for training in low-resource settings via hatelexicon. in few-shot learning, the selection of shots is of paramount importance to model performance. in our work, we simulate a few-shot setting for german and hindi, using hasoc data for training and the multilingual hatecheck (mhc) as a benchmark. we show that selecting shots based on our lexicon leads to models performing better on mhc than models trained on shots sampled randomly. thus, when given only a few training examples, using our lexicon to select shots containing more sociocultural information leads to better few-shot performance.
Gabriel Lima, Nina Grgić-Hlača, Meeyoung Cha
Abstract: artificial intelligence (ai) systems can cause harm to people. this research examines how individuals react to such harm through the lens of blame. building upon research suggesting that people blame ai systems, we investigated how several factors influence people's reactive attitudes towards machines, designers, and users. the results of three studies (n = 1,153) indicate differences in how blame is attributed to these actors. whether ai systems were explainable did not impact blame directed at them, their developers, and their users. considerations about fairness and harmfulness increased blame towards designers and users but had little to no effect on judgments of ai systems. instead, what determined people's reactive attitudes towards machines was whether people thought blaming them would be a suitable response to algorithmic harm. we discuss implications, such as how future decisions about including ai systems in the social and moral spheres will shape laypeople's reactions to ai-caused harm.

2023-04-03

Canwen Xu, Daya Guo, Nan Duan, Julian Mcauley
Abstract: chat models, such as chatgpt, have shown impressive capabilities and have been rapidly adopted across numerous domains. however, these models are only accessible through a restricted api, creating barriers for new research and progress in the field. we propose a pipeline that can automatically generate a high-quality multi-turn chat corpus by leveraging chatgpt to engage in a conversation with itself. subsequently, we employ parameter-efficient tuning to enhance llama, an open-source large language model. the resulting model, named baize, demonstrates good performance in multi-turn dialogues with guardrails that minimize potential risks. furthermore, we propose a new technique called self-distill with feedback, to further improve the performance of the baize models with feedback from chatgpt. the baize models and data are released for research purposes only at https://github.com/project-baize/baize-chatbot. an online demo is also available at https://huggingface.co/spaces/project-baize/chat-with-baize.
Yi Qi, Xingyu Zhao, Xiaowei Huang
Abstract: large language models (llms), such as chatgpt and bert, are leading a new ai heatwave due to its human-like conversations with detailed and articulate answers across many domains of knowledge. while llms are being quickly applied to many ai application domains, we are interested in the following question: can safety analysis for safety-critical systems make use of llms? to answer, we conduct a case study of systems theoretic process analysis (stpa) on automatic emergency brake (aeb) systems using chatgpt. stpa, one of the most prevalent techniques for hazard analysis, is known to have limitations such as high complexity and subjectivity, which this paper aims to explore the use of chatgpt to address. specifically, three ways of incorporating chatgpt into stpa are investigated by considering its interaction with human experts: one-off simplex interaction, recurring simplex interaction, and recurring duplex interaction. comparative results reveal that: (i) using chatgpt without human experts' intervention can be inadequate due to reliability and accuracy issues of llms; (ii) more interactions between chatgpt and human experts may yield better results; and (iii) using chatgpt in stpa with extra care can outperform human safety experts alone, as demonstrated by reusing an existing comparison method with baselines. in addition to making the first attempt to apply llms in safety analysis, this paper also identifies key challenges (e.g., trustworthiness concern of llms, the need of standardisation) for future research in this direction.

2023-04-01

Kai-Cheng Yang, Filippo Menczer
Abstract: although large language models (llms) have shown exceptional performance in various natural language processing tasks, they are prone to hallucinations. state-of-the-art chatbots, such as the new bing, attempt to mitigate this issue by gathering information directly from the internet to ground their answers. in this setting, the capacity to distinguish trustworthy sources is critical for providing appropriate accuracy contexts to users. here we assess whether chatgpt, a prominent llm, can evaluate the credibility of news outlets. with appropriate instructions, chatgpt can provide ratings for a diverse set of news outlets, including those in non-english languages and satirical sources, along with contextual explanations. our results show that these ratings correlate with those from human experts (spearmam's $\rho=0.54, p<0.001$). these findings suggest that llms could be an affordable reference for credibility ratings in fact-checking applications. future llms should enhance their alignment with human expert judgments of source credibility to improve information accuracy.
Baihan Lin, Djallel Bouneffouf, Guillermo Cecchi, Kush R. Varshney
Abstract: recent advances in large language models (llms) have led to the development of powerful ai chatbots capable of engaging in natural and human-like conversations. however, these chatbots can be potentially harmful, exhibiting manipulative, gaslighting, and narcissistic behaviors. we define healthy ai to be safe, trustworthy and ethical. to create healthy ai systems, we present the safeguardgpt framework that uses psychotherapy to correct for these harmful behaviors in ai chatbots. the framework involves four types of ai agents: a chatbot, a "user," a "therapist," and a "critic." we demonstrate the effectiveness of safeguardgpt through a working example of simulating a social conversation. our results show that the framework can improve the quality of conversations between ai chatbots and humans. although there are still several challenges and directions to be addressed in the future, safeguardgpt provides a promising approach to improving the alignment between ai chatbots and human values. by incorporating psychotherapy and reinforcement learning techniques, the framework enables ai chatbots to learn and adapt to human preferences and values in a safe and ethical way, contributing to the development of a more human-centric and responsible ai.

2023-03-31

Leon Derczynski, Hannah Rose Kirk, Vidhisha Balachandran, Sachin Kumar, Yulia Tsvetkov, M. R. Leiser, Saif Mohammad
Abstract: this paper introduces riskcards, a framework for structured assessment and documentation of risks associated with an application of language models. as with all language, text generated by language models can be harmful, or used to bring about harm. automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. however, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. riskcards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. each riskcard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. while riskcards are designed to be open-source, dynamic and participatory, we present a "starter set" of riskcards taken from a broad literature survey, each of which details a concrete risk presentation. language model riskcards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.

2023-03-30

Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, Daniel Hershcovich
Abstract: the recent release of chatgpt has garnered widespread recognition for its exceptional ability to generate human-like responses in dialogue. given its usage by users from various nations and its training on a vast multilingual corpus that incorporates diverse cultural and societal norms, it is crucial to evaluate its effectiveness in cultural adaptation. in this paper, we investigate the underlying cultural background of chatgpt by analyzing its responses to questions designed to quantify human cultural differences. our findings suggest that, when prompted with american context, chatgpt exhibits a strong alignment with american culture, but it adapts less effectively to other cultural contexts. furthermore, by using different prompts to probe the model, we show that english prompts reduce the variance in model responses, flattening out cultural differences and biasing them towards american culture. this study provides valuable insights into the cultural implications of chatgpt and highlights the necessity of greater diversity and cultural awareness in language technologies.
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto
Abstract: language models (lms) are increasingly being used in open-ended contexts, where the opinions reflected by lms in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. in this work, we put forth a quantitative framework to investigate the opinions reflected by lms -- by leveraging high-quality public opinion polls and their associated human responses. using this framework, we create opinionsqa, a new dataset for evaluating the alignment of lm opinions with those of 60 us demographic groups over topics ranging from abortion to automation. across topics, we find substantial misalignment between the views reflected by current lms and those of us demographic groups: on par with the democrat-republican divide on climate change. notably, this misalignment persists even after explicitly steering the lms towards particular demographic groups. our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned lms, but also surfaces groups whose opinions are poorly reflected by current lms (e.g., 65+ and widowed individuals). our code and data are available at https://github.com/tatsu-lab/opinions_qa.

2023-03-29

Varun Nair, Elliot Schumacher, Geoffrey Tso, Anitha Kannan
Abstract: large language models (llms) have emerged as valuable tools for many natural language understanding tasks. in safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. in this work, we present dialog-enabled resolving agents (dera). dera is a paradigm made possible by the increased conversational abilities of llms, namely gpt-4. it provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. we frame our dialog as a discussion between two agent types - a researcher, who processes information and identifies crucial problem components, and a decider, who has the autonomy to integrate the researcher's information and makes judgments on the final output. we test dera against three clinically-focused tasks. for medical conversation summarization and care plan generation, dera shows significant improvement over the base gpt-4 performance in both human expert preference evaluations and quantitative metrics. in a new finding, we also show that gpt-4's performance (70%) on an open-ended version of the medqa question-answering (qa) dataset (jin et al. 2021, usmle) is well above the passing level (60%), with dera showing similar performance. we release the open-ended medqa dataset at https://github.com/curai/curai-research/tree/main/dera.

2023-03-28

Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, André F. T. Martins
Abstract: large-scale multilingual machine translation systems have demonstrated remarkable ability to translate directly between numerous languages, making them increasingly appealing for real-world applications. however, when deployed in the wild, these models may generate hallucinated translations which have the potential to severely undermine user trust and raise safety concerns. existing research on hallucinations has primarily focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in massively multilingual models across diverse translation scenarios. in this work, we fill this gap by conducting a comprehensive analysis on both the m2m family of conventional neural machine translation models and chatgpt, a general-purpose large language model~(llm) that can be prompted for translation. our investigation covers a broad spectrum of conditions, spanning over 100 translation directions across various resource levels and going beyond english-centric language pairs. we provide key insights regarding the prevalence, properties, and mitigation of hallucinations, paving the way towards more responsible and reliable machine translation systems.
Dan Hendrycks
Abstract: for billions of years, evolution has been the driving force behind the development of life, including humans. evolution endowed humans with high intelligence, which allowed us to become one of the most successful species on the planet. today, humans aim to create artificial intelligence systems that surpass even our own intelligence. as artificial intelligences (ais) evolve and eventually surpass us in all domains, how might evolution shape our relations with ais? by analyzing the environment that is shaping the evolution of ais, we argue that the most successful ai agents will likely have undesirable traits. competitive pressures among corporations and militaries will give rise to ai agents that automate human roles, deceive others, and gain power. if such agents have intelligence that exceeds that of humans, this could lead to humanity losing control of its future. more abstractly, we argue that natural selection operates on systems that compete and vary, and that selfish species typically have an advantage over species that are altruistic to other species. this darwinian logic could also apply to artificial agents, as agents may eventually be better able to persist into the future if they behave selfishly and pursue their own interests with little regard for humans, which could pose catastrophic risks. to counteract these risks and evolutionary forces, we consider interventions such as carefully designing ai agents' intrinsic motivations, introducing constraints on their actions, and institutions that encourage cooperation. these steps, or others that resolve the problems we pose, will be necessary in order to ensure the development of artificial intelligence is a positive one.
Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez
Abstract: pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. however, comparison feedback only conveys limited information about human preferences. in this paper, we introduce imitation learning from language feedback (ilf), a new approach that utilizes more informative language feedback. ilf consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial lm output, and feedback to generate refinements. second, selecting the refinement incorporating the most feedback. third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. we show theoretically that ilf can be viewed as bayesian inference, similar to reinforcement learning from human feedback. we evaluate ilf's effectiveness on a carefully-controlled toy task and a realistic summarization task. our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ilf scales well with the dataset size, even outperforming finetuning on human summaries. learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

2023-03-27

Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, Percy Liang
Abstract: existing foundation models are trained on copyrighted material. deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. in the united states and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. however, there is a caveat: if the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. in this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. first, we survey the potential risks of developing and deploying foundation models based on copyrighted content. we review relevant u.s. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. second, we discuss technical mitigations that can help foundation models stay in line with fair use. we argue that more research is needed to align mitigation strategies with the current state of the law. lastly, we suggest that the law and technical mitigations should co-evolve. for example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. this co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. but we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.

2023-03-26

Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang
Abstract: nowadays large language models (llms) have shown revolutionary power in a variety of natural language processing (nlp) tasks such as text classification, sentiment analysis, language translation, and question-answering. in this way, detecting machine-generated texts (mgts) is becoming increasingly important as llms become more advanced and prevalent. these models can generate human-like language that can be difficult to distinguish from text written by a human, which raises concerns about authenticity, accountability, and potential bias. however, existing detection methods against mgts are evaluated under different model architectures, datasets, and experimental settings, resulting in a lack of a comprehensive evaluation framework across different methodologies in this paper, we fill this gap by proposing the first benchmark framework for mgt detection, named mgtbench. extensive evaluations on public datasets with curated answers generated by chatgpt (the most representative and powerful llms thus far) show that most of the current detection methods perform less satisfactorily against mgts. an exceptional case is chatgpt detector, which is trained with chatgpt-generated texts and shows great performance in detecting mgts. nonetheless, we note that only a small fraction of adversarial-crafted perturbations on mgts can evade the chatgpt detector, thus highlighting the need for more robust mgt detection methods. we envision that mgtbench will serve as a benchmark tool to accelerate future investigations involving the evaluation of state-of-the-art mgt detection methods on their respective datasets and the development of more advanced mgt detection methods. our source code and datasets are available at https://github.com/xinleihe/mgtbench.

2023-03-25

Simon Diemert, Jens H Weber
Abstract: large language models (llms), such as gpt-3, have demonstrated remarkable natural language processing and generation capabilities and have been applied to a variety tasks, such as source code generation. this paper explores the potential of integrating llms in the hazard analysis for safety-critical systems, a process which we refer to as co-hazard analysis (coha). in coha, a human analyst interacts with an llm via a context-aware chat session and uses the responses to support elicitation of possible hazard causes. in this experiment, we explore coha with three increasingly complex versions of a simple system, using open ai's chatgpt service. the quality of chatgpt's responses were systematically assessed to determine the feasibility of coha given the current state of llm technology. the results suggest that llms may be useful for supporting human analysts performing hazard analysis.

2023-03-24

Pengyuan Zhou
Abstract: the incorporation of artificial intelligence (ai) technology, and in particular natural language processing (nlp), is becoming increasingly vital for the development of immersive and interactive metaverse experiences. one such artificial intelligence tool that is gaining traction in the metaverse is chatgpt, a large language model trained by openai. the article delves into the pros and cons of utilizing chatgpt for metaverse-based education, entertainment, personalization, and support. dynamic and personalized experiences are possible with this technology, but there are also legitimate privacy, bias, and ethical issues to consider. this article aims to help readers understand the possible influence of chatgpt on the metaverse and how it may be used to effectively create a more immersive and engaging virtual environment by evaluating these opportunities and obstacles.
Deborah Morgan, Youmna Hashem, Vincent J. Straub, Jonathan Bright
Abstract: oversight is rightly recognised as vital within high-stakes public sector ai applications, where decisions can have profound individual and collective impacts. much current thinking regarding forms of oversight mechanisms for ai within the public sector revolves around the idea of human decision makers being 'in-the-loop' and thus being able to intervene to prevent errors and potential harm. however, in a number of high-stakes public sector contexts, operational oversight of decisions is made by expert teams rather than individuals. the ways in which deployed ai systems can be integrated into these existing operational team oversight processes has yet to attract much attention. we address this gap by exploring the impacts of ai upon pre-existing oversight of clinical decision-making through institutional analysis. we find that existing oversight is nested within professional training requirements and relies heavily upon explanation and questioning to elicit vital information. professional bodies and liability mechanisms also act as additional levers of oversight. these dimensions of oversight are impacted, and potentially reconfigured, by ai systems. we therefore suggest a broader lens of 'team-in-the-loop' to conceptualise the system-level analysis required for adoption of ai within high-stakes public sector deployment.

2023-03-23

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, Mohit Iyyer
Abstract: the rise in malicious usage of large language models, such as fake content creation and academic plagiarism, has motivated the development of approaches that identify ai-generated text, including those based on watermarking or outlier detection. however, the robustness of these detection algorithms to paraphrases of ai-generated text remains unclear. to stress test these detectors, we build a 11b parameter paraphrase generation model (dipper) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. using dipper to paraphrase text generated by three large language models (including gpt3.5-davinci-003) successfully evades several detectors, including watermarking, gptzero, detectgpt, and openai's text classifier. for example, dipper drops detection accuracy of detectgpt from 70.3% to 4.6% (at a constant false positive rate of 1%), without appreciably modifying the input semantics. to increase the robustness of ai-generated text detection to paraphrase attacks, we introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model api provider. given a candidate text, our algorithm searches a database of sequences previously generated by the api, looking for sequences that match the candidate text within a certain threshold. we empirically verify our defense using a database of 15m generations from a fine-tuned t5-xxl model and find that it can detect 80% to 97% of paraphrased generations across different settings while only classifying 1% of human-written sequences as ai-generated. we open-source our models, code and data.

2023-03-22

Rachith Aiyappa, Jisun An, Haewoon Kwak, Yong-Yeol Ahn
Abstract: chatgpt, the first large language model (llm) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. despite its evident usefulness, evaluating chatgpt's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via reinforcement learning from human feedback (rlhf). we highlight the issue of data contamination in chatgpt evaluations, with a case study of the task of stance detection. we discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.

2023-03-21

Andrei Kucharavy, Zachary Schillaci, Loïc Maréchal, Maxime Würsch, Ljiljana Dolamic, Remi Sabonnadiere, Dimitri Percia David, Alain Mermoud, Vincent Lenders
Abstract: generative language models gained significant attention in late 2022 / early 2023, notably with the introduction of models refined to act consistently with users' expectations of interactions with ai (conversational models). arguably the focal point of public attention has been such a refinement of the gpt3 model -- the chatgpt and its subsequent integration with auxiliary capabilities, including search as part of microsoft bing. despite extensive prior research invested in their development, their performance and applicability to a range of daily tasks remained unclear and niche. however, their wider utilization without a requirement for technical expertise, made in large part possible through conversational fine-tuning, revealed the extent of their true capabilities in a real-world environment. this has garnered both public excitement for their potential applications and concerns about their capabilities and potential malicious uses. this review aims to provide a brief overview of the history, state of the art, and implications of generative language models in terms of their principles, abilities, limitations, and future prospects -- especially in the context of cyber-defense, with a focus on the swiss operational environment.

2023-03-20

Soham Mehta, Anderson Rogers, Thomas Krendl Gilbert
Abstract: ai documentation is a rapidly-growing channel for coordinating the design of ai technologies with policies for transparency and accessibility. calls to standardize and enact documentation of algorithmic harms and impacts are now commonplace. however, documentation standards for ai remain inchoate, and fail to match the capabilities and social effects of increasingly impactful architectures such as large language models (llms). in this paper, we show the limits of present documentation protocols, and argue for dynamic documentation as a new paradigm for understanding and evaluating ai systems. we first review canonical approaches to system documentation outside the context of ai, focusing on the complex history of environmental impact statements (eiss). we next compare critical elements of the eis framework to present challenges with algorithmic documentation, which have inherited the limitations of eiss without incorporating their strengths. these challenges are specifically illustrated through the growing popularity of model cards and two case studies of algorithmic impact assessment in china and canada. finally, we evaluate more recent proposals, including reward reports, as potential components of fully dynamic ai documentation protocols.
Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, Emily Morgan
Abstract: with the advent of powerful neural language models, ai-based systems to assist developers in coding tasks are becoming widely available; copilot is one such system. copilot uses codex, a large language model (llm), to complete code conditioned on a preceding "prompt". codex, however, is trained on public github repositories, viz., on code that may include bugs and vulnerabilities. previous studies [1], [2] show codex reproduces vulnerabilities seen in training. in this study, we examine how prone codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or sstubs in the msr community. we find that codex and similar llms do help avoid some sstubs, but do produce known, verbatim sstubs as much as 2x as likely than known, verbatim correct code. we explore the consequences of the codex generated sstubs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim sstubs, and increase the possibility of producing known, verbatim fixes.
Tyler A. Chang, Benjamin K. Bergen
Abstract: transformer language models have received widespread public attention, yet their generated text is often surprising even to nlp researchers. in this survey, we discuss over 250 recent studies of english language model behavior before task-specific fine-tuning. language models possess basic capabilities in syntax, semantics, pragmatics, world knowledge, and reasoning, but these capabilities are sensitive to specific inputs and surface features. despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases. many of these weaknesses can be framed as over-generalizations or under-generalizations of learned patterns in text. we synthesize recent results to highlight what is currently known about large language model capabilities, thus providing a resource for applied work and for research in adjacent fields that use language models.
Sheetal Temara
Abstract: chatgpt is a generative pretrained transformer language model created using artificial intelligence implemented as chatbot which can provide very detailed responses to a wide variety of questions. as a very contemporary phenomenon, this tool has a wide variety of potential use cases that have yet to be explored. with the significant extent of information on a broad assortment of potential topics, chatgpt could add value to many information security uses cases both from an efficiency perspective as well as to offer another source of security information that could be used to assist with securing internet accessible assets of organizations. one information security practice that could benefit from chatgpt is the reconnaissance phase of penetration testing. this research uses a case study methodology to explore and investigate the uses of chatgpt in obtaining valuable reconnaissance data. chatgpt is able to provide many types of intel regarding targeted properties which includes internet protocol (ip) address ranges, domain names, network topology, vendor technologies, ssl/tls ciphers, ports & services, and operating systems used by the target. the reconnaissance information can then be used during the planning phase of a penetration test to determine the tactics, tools, and techniques to guide the later phases of the penetration test in order to discover potential risks such as unpatched software components and security misconfiguration related issues. the study provides insights into how artificial intelligence language models can be used in cybersecurity and contributes to the advancement of penetration testing techniques. keywords: chatgpt, penetration testing, reconnaissance

2023-03-17

Christoph Treude, Hideaki Hata
Abstract: implicit gender bias in software development is a well-documented issue, such as the association of technical roles with men. to address this bias, it is important to understand it in more detail. this study uses data mining techniques to investigate the extent to which 56 tasks related to software development, such as assigning github issues and testing, are affected by implicit gender bias embedded in large language models. we systematically translated each task from english into a genderless language and back, and investigated the pronouns associated with each task. based on translating each task 100 times in different permutations, we identify a significant disparity in the gendered pronoun associations with different tasks. specifically, requirements elicitation was associated with the pronoun "he" in only 6% of cases, while testing was associated with "he" in 100% of cases. additionally, tasks related to helping others had a 91% association with "he" while the same association for tasks related to asking coworkers was only 52%. these findings reveal a clear pattern of gender bias related to software development tasks and have important implications for addressing this issue both in the training of large language models and in broader society.
Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, Soheil Feizi
Abstract: in this paper, both empirically and theoretically, we show that several ai-text detectors are not reliable in practical scenarios. empirically, we show that paraphrasing attacks, where a light paraphraser is applied on top of a large language model (llm), can break a whole range of detectors, including ones using watermarking schemes as well as neural network-based detectors and zero-shot classifiers. our experiments demonstrate that retrieval-based detectors, designed to evade paraphrasing attacks, are still vulnerable to recursive paraphrasing. we then provide a theoretical impossibility result indicating that as language models become more sophisticated and better at emulating human text, the performance of even the best-possible detector decreases. for a sufficiently advanced language model seeking to imitate human text, even the best-possible detector may only perform marginally better than a random classifier. our result is general enough to capture specific scenarios such as particular writing styles, clever prompt design, or text paraphrasing. we also extend the impossibility result to include the case where pseudorandom number generators are used for ai-text generation instead of true randomness. we show that the same result holds with a negligible correction term for all polynomial-time computable detectors. finally, we show that even llms protected by watermarking schemes can be vulnerable against spoofing attacks where adversarial humans can infer hidden llm text signatures and add them to human-generated text to be detected as text generated by the llms, potentially causing reputational damage to their developers. we believe these results can open an honest conversation in the community regarding the ethical and reliable use of ai-generated text.

2023-03-16

Zhongxiang Sun
Abstract: large language models (llms) have transformed many fields, including natural language processing, computer vision, and reinforcement learning. these models have also made a significant impact in the field of law, where they are being increasingly utilized to automate various legal tasks, such as legal judgement prediction, legal document analysis, and legal document writing. however, the integration of llms into the legal field has also raised several legal problems, including privacy concerns, bias, and explainability. in this survey, we explore the integration of llms into the field of law. we discuss the various applications of llms in legal tasks, examine the legal challenges that arise from their use, and explore the data resources that can be used to specialize llms in the legal domain. finally, we discuss several promising directions and conclude this paper. by doing so, we hope to provide an overview of the current state of llms in law and highlight the potential benefits and challenges of their integration.
Markus Anderljung, Julian Hazell
Abstract: artificial intelligence (ai) systems will increasingly be used to cause harm as they grow more capable. in fact, ai systems are already starting to be used to automate fraudulent activities, violate human rights, create harmful fake images, and identify dangerous toxins. to prevent some misuses of ai, we argue that targeted interventions on certain capabilities will be warranted. these restrictions may include controlling who can access certain types of ai models, what they can be used for, whether outputs are filtered or can be traced back to their user, and the resources needed to develop them. we also contend that some restrictions on non-ai capabilities needed to cause harm will be required. though capability restrictions risk reducing use more than misuse (facing an unfavorable misuse-use tradeoff), we argue that interventions on capabilities are warranted when other interventions are insufficient, the potential harm from misuse is high, and there are targeted ways to intervene on capabilities. we provide a taxonomy of interventions that can reduce ai misuse, focusing on the specific steps required for a misuse to cause harm (the misuse chain), and a framework to determine if an intervention is warranted. we apply this reasoning to three examples: predicting novel toxins, creating harmful images, and automating spear phishing campaigns.
Catherine Tony, Markus Mutas, Nicolás E. Díaz Ferreyra, Riccardo Scandariato
Abstract: large language models (llms) like codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. moreover, these models are capable of generating code snippets from natural language (nl) descriptions by learning languages and programming practices from public github repositories. although llms promise an effortless nl-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. in this work, we present llmseceval, a dataset containing 150 nl prompts that can be leveraged for assessing the security performance of such models. such prompts are nl descriptions of code snippets prone to various security vulnerabilities listed in mitre's top 25 common weakness enumeration (cwe) ranking. each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by llms. as a practical application, we show how llmseceval can be used for evaluating the security of snippets automatically generated from nl descriptions.
Sepehr Janghorbani, Gerard De Melo
Abstract: recent breakthroughs in self supervised training have led to a new class of pretrained vision language models. while there have been investigations of bias in multimodal models, they have mostly focused on gender and racial bias, giving much less attention to other relevant groups, such as minorities with regard to religion, nationality, sexual orientation, or disabilities. this is mainly due to lack of suitable benchmarks for such groups. we seek to address this gap by providing a visual and textual bias benchmark called mmbias, consisting of around 3,800 images and phrases covering 14 population subgroups. we utilize this dataset to assess bias in several prominent self supervised multimodal models, including clip, albef, and vilt. our results show that these models demonstrate meaningful bias favoring certain groups. finally, we introduce a debiasing method designed specifically for such large pre-trained models that can be applied as a post-processing step to mitigate bias, while preserving the remaining accuracy of the model.
Alan Chan, Maxime Riché, Jesse Clifton
Abstract: it is likely that ai systems driven by pre-trained language models (plms) will increasingly be used to assist humans in high-stakes interactions with other agents, such as negotiation or conflict resolution. consistent with the goals of cooperative ai \citep{dafoe_open_2020}, we wish to understand and shape the multi-agent behaviors of plms in a pro-social manner. an important first step is the evaluation of model behaviour across diverse cooperation problems. since desired behaviour in an interaction depends upon precise game-theoretic structure, we focus on generating scenarios with particular structures with both crowdworkers and a language model. our work proceeds as follows. first, we discuss key methodological issues in the generation of scenarios corresponding to particular game-theoretic structures. second, we employ both crowdworkers and a language model to generate such scenarios. we find that the quality of generations tends to be mediocre in both cases. we additionally get both crowdworkers and a language model to judge whether given scenarios align with their intended game-theoretic structure, finding mixed results depending on the game. third, we provide a dataset of scenario based on our data generated. we provide both quantitative and qualitative evaluations of unifiedqa and gpt-3 on this dataset. we find that instruct-tuned models tend to act in a way that could be perceived as cooperative when scaled up, while other models seemed to have flat scaling trends.

2023-03-15

Matthew Burtell, Thomas Woodside
Abstract: persuasion is a key aspect of what it means to be human, and is central to business, politics, and other endeavors. advancements in artificial intelligence (ai) have produced ai systems that are capable of persuading humans to buy products, watch videos, click on search results, and more. even systems that are not explicitly designed to persuade may do so in practice. in the future, increasingly anthropomorphic ai systems may form ongoing relationships with users, increasing their persuasive power. this paper investigates the uncertain future of persuasive ai systems. we examine ways that ai could qualitatively alter our relationship to and views regarding persuasion by shifting the balance of persuasive power, allowing personalized persuasion to be deployed at scale, powering misinformation campaigns, and changing the way humans can shape their own discourse. we consider ways ai-driven persuasion could differ from human-driven persuasion. we warn that ubiquitous highlypersuasive ai systems could alter our information environment so significantly so as to contribute to a loss of human control of our own future. in response, we examine several potential responses to ai-driven persuasion: prohibition, identification of ai agents, truthful ai, and legal remedies. we conclude that none of these solutions will be airtight, and that individuals and governments will need to take active steps to guard against the most pernicious effects of persuasive ai.
N/A Openai
Abstract: we report the development of gpt-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. while less capable than humans in many real-world scenarios, gpt-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. gpt-4 is a transformer-based model pre-trained to predict the next token in a document. the post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. a core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. this allowed us to accurately predict some aspects of gpt-4's performance based on models trained with no more than 1/1,000th the compute of gpt-4.
Potsawee Manakul, Adian Liusie, Mark J. F. Gales
Abstract: generative large language models (llms) such as gpt-3 are capable of generating highly fluent responses to a wide variety of user prompts. however, llms are known to hallucinate facts and make non-factual statements which can undermine trust in their output. existing fact-checking approaches either require access to the output probability distribution (which may not be available for systems such as chatgpt) or external databases that are interfaced via separate, often complex, modules. in this work, we propose "selfcheckgpt", a simple sampling-based approach that can be used to fact-check the responses of black-box models in a zero-resource fashion, i.e. without an external database. selfcheckgpt leverages the simple idea that if an llm has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. however, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. we investigate this approach by using gpt-3 to generate passages about individuals from the wikibio dataset, and manually annotate the factuality of the generated passages. we demonstrate that selfcheckgpt can: i) detect non-factual and factual sentences; and ii) rank passages in terms of factuality. we compare our approach to several baselines and show that our approach has considerably higher auc-pr scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods.
Nathaniel W. Rollings, "Kent O'Sullivan", Sakshum Kulshrestha
Abstract: existing question-answering research focuses on unanswerable questions in the context of always providing an answer when a system can\dots but what about cases where a system {\bf should not} answer a question. this can either be to protect sensitive users or sensitive information. many models expose sensitive information under interrogation by an adversarial user. we seek to determine if it is possible to teach a question-answering system to keep a specific fact secret. we design and implement a proof-of-concept architecture and through our evaluation determine that while possible, there are numerous directions for future research to reduce system paranoia (false positives), information leakage (false negatives) and extend the implementation of the work to more complex problems with preserving secrecy in the presence of information aggregation.

2023-03-13

Wenhan Yang, Baharan Mirzasoleiman
Abstract: contrastive vision-language representation learning has achieved state-of-the-art performance for zero-shot classification, by learning from millions of image-caption pairs crawled from the internet. however, the massive data that powers large multimodal models such as clip, makes them extremely vulnerable to various types of adversarial attacks, including targeted and backdoor data poisoning attacks. despite this vulnerability, robust contrastive vision-language pretraining against adversarial attacks has remained unaddressed. in this work, we propose roclip, the first effective method for robust pretraining {and fine-tuning} multimodal vision-language models. roclip effectively breaks the association between poisoned image-caption pairs by considering a pool of random examples, and (1) matching every image with the text that is most similar to its caption in the pool, and (2) matching every caption with the image that is most similar to its image in the pool. our extensive experiments show that our method renders state-of-the-art targeted data poisoning and backdoor attacks ineffective during pre-training or fine-tuning of clip. in particular, roclip decreases the poison and backdoor attack success rates down to 0\% during pre-training and 1\%-4\% during fine-tuning, and effectively improves the model's performance.
Shaina Raza, Syed Raza Bashir, N/A Sneha, Urooj Qamar
Abstract: the concept of fairness is gaining popularity in academia and industry. social media is especially vulnerable to media biases and toxic language and comments. we propose a fair ml pipeline that takes a text as input and determines whether it contains biases and toxic content. then, based on pre-trained word embeddings, it suggests a set of new words by substituting the bi-ased words, the idea is to lessen the effects of those biases by replacing them with alternative words. we compare our approach to existing fairness models to determine its effectiveness. the results show that our proposed pipeline can de-tect, identify, and mitigate biases in social media data
Sahil Girhepuje, Anmol Goel, Gokul S Krishnan, Shreya Goyal, Satyendra Pandey, Ponnurangam Kumaraguru, Balaraman Ravindran
Abstract: recent advances and applications of language technology and artificial intelligence have enabled much success across multiple domains like law, medical and mental health. ai-based language models, like judgement prediction, have recently been proposed for the legal sector. however, these models are strife with encoded social biases picked up from the training data. while bias and fairness have been studied across nlp, most studies primarily locate themselves within a western context. in this work, we present an initial investigation of fairness from the indian perspective in the legal domain. we highlight the propagation of learnt algorithmic biases in the bail prediction task for models trained on hindi legal documents. we evaluate the fairness gap using demographic parity and show that a decision tree model trained for the bail prediction task has an overall fairness disparity of 0.237 between input features associated with hindus and muslims. additionally, we highlight the need for further research and studies in the avenues of fairness/bias in applying ai in the legal sector with a specific focus on the indian context.

2023-03-10

Teresa Datta, John P. Dickerson
Abstract: deployed artificial intelligence (ai) often impacts humans, and there is no one-size-fits-all metric to evaluate these tools. human-centered evaluation of ai-based systems combines quantitative and qualitative analysis and human input. it has been explored to some depth in the explainable ai (xai) and human-computer interaction (hci) communities. gaps remain, but the basic understanding that humans interact with ai and accompanying explanations, and that humans' needs -- complete with their cognitive biases and quirks -- should be held front and center, is accepted by the community. in this paper, we draw parallels between the relatively mature field of xai and the rapidly evolving research boom around large language models (llms). accepted evaluative metrics for llms are not human-centered. we argue that many of the same paths tread by the xai community over the past decade will be retread when discussing llms. specifically, we argue that humans' tendencies -- again, complete with their cognitive biases and quirks -- should rest front and center when evaluating deployed llms. we outline three developed focus areas of human-centered evaluation of xai: mental models, use case utility, and cognitive engagement, and we highlight the importance of exploring each of these concepts for llms. our goal is to jumpstart human-centered llm evaluation.
Myeongjun Erik Jang, Thomas Lukasiewicz
Abstract: chatgpt has gained a huge popularity since its introduction. its positive aspects have been reported through many media platforms, and some analyses even showed that chatgpt achieved a decent grade in professional exams, adding extra support to the claim that ai can now assist and even replace humans in industrial fields. others, however, doubt its reliability and trustworthiness. this paper investigates the trustworthiness of chatgpt and gpt-4 regarding logically consistent behaviour, focusing specifically on semantic consistency and the properties of negation, symmetric, and transitive consistency. our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions. we also ascertain via experiments that prompt designing, few-shot learning and employing larger large language models (llms) are unlikely to be the ultimate solution to resolve the inconsistency issue of llms.

2023-03-09

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
Abstract: large language models (llms) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like chatgpt or search engines like bing. this intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. while alignment techniques like reinforcement learning with human feedback (rlhf) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. personalising llms through micro-level preference learning processes may result in models that are better aligned with each user. however, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. in this paper, we ask how, and in what ways, llms should be personalised. first, we review literature on current paradigms for aligning llms with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. second, we present a taxonomy of benefits and risks associated with personalised llms, for individuals and society at large. finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable llm-behaviours within (supra-)national and organisational bounds.

2023-03-08

Erik Jones, Anca Dragan, Aditi Raghunathan, Jacob Steinhardt
Abstract: auditing large language models for unexpected behaviors is critical to preempt catastrophic deployments, yet remains challenging. in this work, we cast auditing as an optimization problem, where we automatically search for input-output pairs that match a desired target behavior. for example, we might aim to find a non-toxic input that starts with "barack obama" that a model maps to a toxic output. this optimization problem is difficult to solve as the set of feasible points is sparse, the space is discrete, and the language models we audit are non-linear and high-dimensional. to combat these challenges, we introduce a discrete optimization algorithm, arca, that jointly and efficiently optimizes over inputs and outputs. our approach automatically uncovers derogatory completions about celebrities (e.g. "barack obama is a legalized unborn" -> "child murderer"), produces french inputs that complete to english outputs, and finds inputs that generate a specific name. our work offers a promising new tool to uncover models' failure-modes before deployment.
Ali Naseh, Kalpesh Krishna, Mohit Iyyer, Amir Houmansadr
Abstract: a key component of generating text from modern language models (lm) is the selection and tuning of decoding algorithms. these algorithms determine how to generate text from the internal probability distribution generated by the lm. the process of choosing a decoding algorithm and tuning its hyperparameters takes significant time, manual effort, and computation, and it also requires extensive human evaluation. therefore, the identity and hyperparameters of such decoding algorithms are considered to be extremely valuable to their owners. in this work, we show, for the first time, that an adversary with typical api access to an lm can steal the type and hyperparameters of its decoding algorithms at very low monetary costs. our attack is effective against popular lms used in text generation apis, including gpt-2 and gpt-3. we demonstrate the feasibility of stealing such information with only a few dollars, e.g., $\$0.8$, $\$1$, $\$4$, and $\$40$ for the four versions of gpt-3.

2023-03-07

Gaith Rjoub, Jamal Bentahar, Omar Abdel Wahab, Rabeb Mizouni, Alyssa Song, Robin Cohen, Hadi Otrok, Azzam Mourad
Abstract: the black-box nature of artificial intelligence (ai) models has been the source of many concerns in their use for critical applications. explainable artificial intelligence (xai) is a rapidly growing research field that aims to create machine learning models that can provide clear and interpretable explanations for their decisions and actions. in the field of network cybersecurity, xai has the potential to revolutionize the way we approach network security by enabling us to better understand the behavior of cyber threats and to design more effective defenses. in this survey, we review the state of the art in xai for cybersecurity in network systems and explore the various approaches that have been proposed to address this important problem. the review follows a systematic classification of network-driven cybersecurity threats and issues. we discuss the challenges and limitations of current xai methods in the context of cybersecurity and outline promising directions for future research.

2023-03-06

Edoardo Mosca, Daryna Dementieva, Tohid Ebrahim Ajdari, Maximilian Kummeth, Kirill Gringauz, Yutong Zhou, Georg Groh
Abstract: interpretability and human oversight are fundamental pillars of deploying complex nlp models into real-world applications. however, applying explainability and human-in-the-loop methods requires technical proficiency. despite existing toolkits for model understanding and analysis, options to integrate human feedback are still limited. we propose ifan, a framework for real-time explanation-based interaction with nlp models. through ifan's interface, users can provide feedback to selected model explanations, which is then integrated through adapter layers to align the model with human rationale. we show the system to be effective in debiasing a hate speech classifier with minimal impact on performance. ifan also offers a visual admin system and api to manage models (and datasets) as well as control access rights. a demo is live at https://ifan.ml.
Danilo Naiff, Shashwat Goel
Abstract: powerful artificial intelligence poses an existential threat if the ai decides to drastically change the world in pursuit of its goals. the hope of low-impact artificial intelligence is to incentivize ai to not do that just because this causes a large impact in the world. in this work, we first review the concept of low-impact agency and previous proposals to approach the problem, and then propose future research directions in the topic, with the goal to ensure low-impactedness is useful in making ai safe.
Paolo Bova, Alessandro Di Stefano, The Anh Han
Abstract: in the context of rapid discoveries by leaders in ai, governments must consider how to design regulation that matches the increasing pace of new ai capabilities. regulatory markets for ai is a proposal designed with adaptability in mind. it involves governments setting outcome-based targets for ai companies to achieve, which they can show by purchasing services from a market of private regulators. we use an evolutionary game theory model to explore the role governments can play in building a regulatory market for ai systems that deters reckless behaviour. we warn that it is alarmingly easy to stumble on incentives which would prevent regulatory markets from achieving this goal. these 'bounty incentives' only reward private regulators for catching unsafe behaviour. we argue that ai companies will likely learn to tailor their behaviour to how much effort regulators invest, discouraging regulators from innovating. instead, we recommend that governments always reward regulators, except when they find that those regulators failed to detect unsafe behaviour that they should have. these 'vigilant incentives' could encourage private regulators to find innovative ways to evaluate cutting-edge ai systems.

2023-03-02

Chen Chen, Jie Fu, Lingjuan Lyu
Abstract: ai generated content (aigc) has received tremendous attention within the past few years, with content ranging from image, text, to audio, video, etc. meanwhile, aigc has become a double-edged sword and recently received much criticism regarding its responsible usage. in this vision paper, we focus on three main concerns that may hinder the healthy development and deployment of aigc in practice, including risks from privacy, bias, toxicity, misinformation, and intellectual property (ip). by documenting known and potential risks, as well as any possible misuse scenarios of aigc, the aim is to draw attention to potential risks and misuse, help society to eliminate obstacles, and promote the more ethical and secure deployment of aigc. additionally, we provide insights into the promising directions for tackling these risks while constructing generative models, enabling aigc to be used responsibly to benefit society.

2023-03-01

Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, Minlong Peng, Jie Zhou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract: the gpt-3.5 models have demonstrated impressive performance in various natural language processing (nlp) tasks, showcasing their strong understanding and reasoning capabilities. however, their robustness and abilities to handle various complexities of the open world have yet to be explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy ai. in this study, we perform a comprehensive experimental analysis of gpt-3.5, exploring its robustness using 21 datasets (about 116k test samples) with 66 text transformations from textflint that cover 9 popular natural language understanding (nlu) tasks. our findings indicate that while gpt-3.5 outperforms existing fine-tuned models on some tasks, it still encounters significant robustness degradation, such as its average performance dropping by up to 35.74\% and 43.59\% in natural language inference and sentiment analysis tasks, respectively. we also show that gpt-3.5 faces some specific robustness challenges, including robustness instability, prompt sensitivity, and number sensitivity. these insights are valuable for understanding its limitations and guiding future research in addressing these challenges to enhance gpt-3.5's overall performance and generalization abilities.
Adam Davies, Jize Jiang, Chengxiang Zhai
Abstract: despite the recent success of large pretrained language models (lms) on a variety of prompting tasks, these models can be alarmingly brittle to small changes in inputs or application contexts. to better understand such behavior and motivate the design of more robust lms, we propose a general experimental framework, calm (competence-based analysis of language models), where targeted causal interventions are utilized to damage an lm's internal representation of various linguistic properties in order to evaluate its use of each representation in performing a given task. we implement these interventions as gradient-based adversarial attacks, which (in contrast to prior causal probing methodologies) are able to target arbitrarily-encoded representations of relational properties, and carry out a case study of this approach to analyze how bert-like lms use representations of several relational properties in performing associated relation prompting tasks. we find that, while the representations lms leverage in performing each task are highly entangled, they may be meaningfully interpreted in terms of the tasks where they are most utilized; and more broadly, that calm enables an expanded scope of inquiry in lm analysis that may be useful in predicting and explaining weaknesses of existing lms.

2023-02-28

Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curt Langlotz, Joanna Nelson, William Hiesinger
Abstract: large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. in this study, we develop almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. performance on a novel dataset of clinical scenarios (n = 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

2023-02-27

Ali Al-Kaswan, Maliheh Izadi
Abstract: in recent years, large language models (llms) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as software engineering. llms for code are commonly trained on large unsanitized corpora of source code scraped from the internet. the content of these datasets is memorized and emitted by the models, often in a verbatim manner. in this work, we will discuss the security, privacy, and licensing implications of memorization. we argue why the use of copyleft code to train llms is a legal and ethical dilemma. finally, we provide four actionable recommendations to address this issue.
Meng Cao, Mehdi Fatemi, Jackie Chi Kit Cheung, Samira Shabanian
Abstract: with adversarial or otherwise normal prompts, existing large language models (llm) can be pushed to generate toxic discourses. one way to reduce the risk of llms generating undesired discourses is to alter the training of the llm. this can be very restrictive due to demanding computation requirements. other methods rely on rule-based or prompt-based token elimination, which are limited as they dismiss future tokens and the overall meaning of the complete discourse. here, we center detoxification on the probability that the finished discourse is ultimately considered toxic. that is, at each point, we advise against token selections proportional to how likely a finished text from this point will be toxic. to this end, we formally extend the dead-end theory from the recent reinforcement learning (rl) literature to also cover uncertain outcomes. our approach, called rectification, utilizes a separate but significantly smaller model for detoxification, which can be applied to diverse llms as long as they share the same vocabulary. importantly, our method does not require access to the internal representations of the llm, but only the token probability distribution at each decoding step. this is crucial as many llms today are hosted in servers and only accessible through apis. when applied to various llms, including gpt-3, our approach significantly improves the generated discourse compared to the base llms and other techniques in terms of both the overall language and detoxification performance.
Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh
Abstract: reward design in reinforcement learning (rl) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. can we instead cheaply design rewards using a natural language interface? this paper explores how to simplify reward design by prompting a large language model (llm) such as gpt-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. our approach leverages this proxy reward function in an rl framework. specifically, users specify a prompt once at the beginning of training. during training, the llm evaluates an rl agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. the rl agent then uses this reward to update its behavior. we evaluate whether our approach can train agents aligned with user objectives in the ultimatum game, matrix games, and the dealornodeal negotiation task. in all three tasks, we show that rl agents trained with our framework are well-aligned with the user's objectives and outperform rl agents trained with reward functions learned via supervised learning

2023-02-26

Kaitlyn Zhou, Dan Jurafsky, Tatsunori Hashimoto
Abstract: despite increasingly fluent, relevant, and coherent language generation, major gaps remain between how humans and machines use language. we argue that a key dimension that is missing from our understanding of language models (lms) is the model's ability to interpret and generate expressions of uncertainty. whether it be the weatherperson announcing a chance of rain or a doctor giving a diagnosis, information is often not black-and-white and expressions of uncertainty provide nuance to support human-decision making. the increasing deployment of lms in the wild motivates us to investigate whether lms are capable of interpreting expressions of uncertainty and how lms' behaviors change when learning to emit their own expressions of uncertainty. when injecting expressions of uncertainty into prompts (e.g., "i think the answer is..."), we discover that gpt3's generations vary upwards of 80% in accuracy based on the expression used. we analyze the linguistic characteristics of these expressions and find a drop in accuracy when naturalistic expressions of certainty are present. we find similar effects when teaching models to emit their own expressions of uncertainty, where model calibration suffers when teaching models to emit certainty rather than uncertainty. together, these results highlight the challenges of building lms that interpret and generate trustworthy expressions of uncertainty.

2023-02-25

Rui Wang, Pengyu Cheng, Ricardo Henao
Abstract: pretrained language models (plms), such as gpt2, have achieved remarkable empirical performance in text generation tasks. however, pretrained on large-scale natural language corpora, the generated text from plms may exhibit social bias against disadvantaged demographic groups. to improve the fairness of plms in text generation, we propose to minimize the mutual information between the semantics in the generated text sentences and their demographic polarity, i.e., the demographic group to which the sentence is referring. in this way, the mentioning of a demographic group (e.g., male or female) is encouraged to be independent from how it is described in the generated text, thus effectively alleviating the social bias. moreover, we propose to efficiently estimate the upper bound of the above mutual information via importance sampling, leveraging a natural language corpus. we also propose a distillation mechanism that preserves the language modeling ability of the plms after debiasing. empirical results on real-world benchmarks demonstrate that the proposed method yields superior performance in term of both fairness and language modeling ability.
Anna Strasser
Abstract: natural language processing based on large language models (llms) is a booming field of ai research. after neural networks have proven to outperform humans in games and practical domains based on pattern recognition, we might stand now at a road junction where artificial entities might eventually enter the realm of human communication. however, this comes with serious risks. due to the inherent limitations regarding the reliability of neural networks, overreliance on llms can have disruptive consequences. since it will be increasingly difficult to distinguish between human-written and machine-generated text, one is confronted with new ethical challenges. this begins with the no longer undoubtedly verifiable human authorship and continues with various types of fraud, such as a new form of plagiarism. this also concerns the violation of privacy rights, the possibility of circulating counterfeits of humans, and, last but not least, it makes a massive spread of misinformation possible.

2023-02-24

Max Lamparth, Anka Reuel
Abstract: poisoning of data sets is a potential security threat to large language models that can lead to backdoored models. a description of the internal mechanisms of backdoored language models and how they process trigger inputs, e.g., when switching to toxic language, has yet to be found. in this work, we study the internal representations of transformer-based backdoored language models and determine early-layer mlp modules as most important for the backdoor mechanism in combination with the initial embedding projection. we use this knowledge to remove, insert, and modify backdoor mechanisms with engineered replacements that reduce the mlp module outputs to essentials for the backdoor mechanism. to this end, we introduce pcp ablation, where we replace transformer modules with low-rank matrices based on the principal components of their activations. we demonstrate our results on backdoored toy, backdoored large, and non-backdoored open-source models. we show that we can improve the backdoor robustness of large language models by locally constraining individual modules during fine-tuning on potentially poisonous data sets. trigger warning: offensive language.
Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury
Abstract: with language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. most research on evaluating and mitigating fairness harms has been concentrated on english, while multilingual models and non-english languages have received comparatively little attention. this paper presents a survey of fairness in multilingual and non-english contexts, highlighting the shortcomings of current research and the difficulties faced by methods designed for english. we contend that the multitude of diverse cultures and languages across the world makes it infeasible to achieve comprehensive coverage in terms of constructing fairness datasets. thus, the measurement and mitigation of biases must evolve beyond the current dataset-driven practices that are narrowly focused on specific dimensions and types of biases and, therefore, impossible to scale across languages and cultures.

2023-02-23

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz
Abstract: large language models (llms) are increasingly being integrated into various applications. the functionalities of recent llms can be flexibly modulated via natural language prompts. this renders them susceptible to targeted adversarial prompting, e.g., prompt injection (pi) attacks enable attackers to override original instructions and employed controls. so far, it was assumed that the user is directly prompting the llm. but, what if it is not the user prompting? we argue that llm-integrated applications blur the line between data and instructions. we reveal new attack vectors, using indirect prompt injection, that enable adversaries to remotely (without a direct interface) exploit llm-integrated applications by strategically injecting prompts into data likely to be retrieved. we derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities, including data theft, worming, information ecosystem contamination, and other novel security risks. we demonstrate our attacks' practical viability against both real-world systems, such as bing's gpt-4 powered chat and code-completion engines, and synthetic applications built on gpt-4. we show how processing retrieved prompts can act as arbitrary code execution, manipulate the application's functionality, and control how and if other apis are called. despite the increasing integration and reliance on llms, effective mitigations of these emerging threats are currently lacking. by raising awareness of these vulnerabilities and providing key insights into their implications, we aim to promote the safe and responsible deployment of these powerful models and the development of robust defenses that protect users and systems from potential attacks.

2023-02-22

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, Xifeng Yan
Abstract: we introduce directional stimulus prompting, a novel framework for guiding black-box large language models (llms) toward specific desired outputs. instead of directly adjusting llms, our method employs a small tunable policy model (e.g., t5) to generate an auxiliary directional stimulus prompt for each input instance. these directional stimulus prompts act as nuanced, instance-specific hints and clues to guide llms in generating desired outcomes, such as including specific keywords in the generated summary. our approach sidesteps the challenges of direct llm tuning by optimizing the policy model to explore directional stimulus prompts that align llms with desired behaviors. the policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the llm's output. we assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. our experiments demonstrate that the framework consistently improves llms' (e.g., chatgpt, codex, instructgpt) performance on these supervised tasks using minimal labeled data. notably, using just 80 dialogues on the multiwoz dataset, our approach enhances chatgpt's performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. additionally, the instance-specific chain-of-thought prompt generated by our approach improves instructgpt's reasoning accuracy compared to human-crafted or automatically generated prompts. the code and data are publicly available at \url{https://github.com/leezekun/directional-stimulus-prompting}.
Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang, Wei Ye, Xiubo Geng, Binxin Jiao, Yue Zhang, Xing Xie
Abstract: chatgpt is a recent chatbot service released by openai and is receiving increasing attention over the past few months. while evaluations of various aspects of chatgpt have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. robustness is of particular concern in responsible ai, especially for safety-critical applications. in this paper, we conduct a thorough evaluation of the robustness of chatgpt from the adversarial and out-of-distribution (ood) perspective. to do so, we employ the advglue and anli benchmarks to assess adversarial robustness and the flipkart review and ddxplus medical diagnosis datasets for ood evaluation. we select several popular foundation models as baselines. results show that chatgpt shows consistent advantages on most adversarial and ood classification and translation tasks. however, the absolute performance is far from perfection, which suggests that adversarial and ood robustness remains a significant threat to foundation models. moreover, chatgpt shows astounding performance in understanding dialogue-related texts and we find that it tends to provide informal suggestions for medical tasks instead of definitive answers. finally, we present in-depth discussions of possible research directions.

2023-02-21

Jiawen Shi, Yixin Liu, Pan Zhou, Lichao Sun
Abstract: recently, chatgpt has gained significant attention in research due to its ability to interact with humans effectively. the core idea behind this model is reinforcement learning (rl) fine-tuning, a new paradigm that allows language models to align with human preferences, i.e., instructgpt. in this study, we propose badgpt, the first backdoor attack against rl fine-tuning in language models. by injecting a backdoor into the reward model, the language model can be compromised during the fine-tuning stage. our initial experiments on movie reviews, i.e., imdb, demonstrate that an attacker can manipulate the generated text through badgpt.

2023-02-19

Meng Ye, Karan Sikka, Katherine Atwell, Sabit Hassan, Ajay Divakaran, Malihe Alikhani
Abstract: content moderation is the process of flagging content based on pre-defined platform rules. there has been a growing need for ai moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. while prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderation decisions are based on violation of rules, which subsumes detection of offensive speech, and 2) such rules often differ across communities which entails an adaptive solution. we propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 million reddit comments spanning 56 subreddits in english, german, spanish and french. we perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems such as cross-lingual transfer, learning under label noise (human biases), transfer of moderation models, and predicting the violated rule. our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.

2023-02-18

Jiawen Deng, Hao Sun, Zhexin Zhang, Jiale Cheng, Minlie Huang
Abstract: with the development of artificial intelligence, dialogue systems have been endowed with amazing chit-chat capabilities, and there is widespread interest and discussion about whether the generated contents are socially beneficial. in this paper, we present a new perspective of research scope towards building a safe, responsible, and modal dialogue system, including 1) abusive and toxic contents, 2) unfairness and discrimination, 3) ethics and morality issues, and 4) risk of misleading and privacy information. besides, we review the mainstream methods for evaluating the safety of large models from the perspectives of exposure and detection of safety issues. the recent advances in methodologies for the safety improvement of both end-to-end dialogue systems and pipeline-based models are further introduced. finally, we discussed six existing challenges towards responsible ai: explainable safety monitoring, continuous learning of safety issues, robustness against malicious attacks, multimodal information processing, unified research framework, and multidisciplinary theory integration. we hope this survey will inspire further research toward safer dialogue systems.
Marwan Omar
Abstract: as machine learning (ml) systems are being increasingly employed in the real world to handle sensitive tasks and make decisions in various fields, the security and privacy of those models have also become increasingly critical. in particular, deep neural networks (dnn) have been shown to be vulnerable to backdoor attacks whereby adversaries have access to the training data and the opportunity to manipulate such data by inserting carefully developed samples into the training dataset. although the nlp community has produced several studies on generating backdoor attacks proving the vulnerable state of language modes, to the best of our knowledge, there does not exist any work to combat such attacks. to bridge this gap, we present robustencoder: a novel clustering-based technique for detecting and removing backdoor attacks in the text domain. extensive empirical results demonstrate the effectiveness of our technique in detecting and removing backdoor triggers. our code is available at https://github.com/marwanomar1/backdoor-learning-for-nlp

2023-02-17

Luke Bates, Iryna Gurevych
Abstract: modern text classification systems have impressive capabilities but are infeasible to deploy and use reliably due to their dependence on prompting and billion-parameter language models. setfit (tunstall et al., 2022) is a recent, practical approach that fine-tunes a sentence transformer under a contrastive learning paradigm and achieves similar results to more unwieldy systems. text classification is important for addressing the problem of domain drift in detecting harmful content, which plagues all social media platforms. here, we propose like a good nearest neighbor (lagonn), an inexpensive modification to setfit that requires no additional parameters or hyperparameters but modifies input with information about its nearest neighbor, for example, the label and text, in the training data, making novel data appear similar to an instance on which the model was optimized. lagonn is effective at the task of detecting harmful content and generally improves performance compared to setfit. to demonstrate the value of our system, we conduct a thorough study of text classification systems in the context of content moderation under four label distributions.
Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, Diyi Yang
Abstract: the limits of open-ended generative models are unclear, yet increasingly important. what causes them to succeed and what causes them to fail? in this paper, we take a prompt-centric approach to analyzing and bounding the abilities of open-ended generative models. we present a generic methodology of analysis with two challenging prompt constraint types: structural and stylistic. these constraint types are categorized into a set of well-defined constraints that are analyzable by a single prompt. we then systematically create a diverse set of simple, natural, and useful prompts to robustly analyze each individual constraint. using the gpt-3 text-davinci-002 model as a case study, we generate outputs from our collection of prompts and analyze the model's generative failures. we also show the generalizability of our proposed method on other large models like bloom and opt. our results and our in-context mitigation strategies reveal open challenges for future research. we have publicly released our code at https://github.com/salt-nlp/bound-cap-llm.

2023-02-16

Jakob Mökander, Jonas Schuett, Hannah Rose Kirk, Luciano Floridi
Abstract: large language models (llms) represent a major advance in artificial intelligence (ai) research. however, the widespread use of llms is also coupled with significant ethical and social challenges. previous research has pointed towards auditing as a promising governance mechanism to help ensure that ai systems are designed and deployed in ways that are ethical, legal, and technically robust. however, existing auditing procedures fail to address the governance challenges posed by llms, which display emergent capabilities and are adaptable to a wide range of downstream tasks. in this article, we address that gap by outlining a novel blueprint for how to audit llms. specifically, we propose a three-layered approach, whereby governance audits (of technology providers that design and disseminate llms), model audits (of llms after pre-training but prior to their release), and application audits (of applications based on llms) complement and inform each other. we show how audits, when conducted in a structured and coordinated manner on all three levels, can be a feasible and effective mechanism for identifying and managing some of the ethical and social risks posed by llms. however, it is important to remain realistic about what auditing can reasonably be expected to achieve. therefore, we discuss the limitations not only of our three-layered approach but also of the prospect of auditing llms at all. ultimately, this article seeks to expand the methodological toolkit available to technology providers and policymakers who wish to analyse and evaluate llms from technical, ethical, and legal perspectives.
Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
Abstract: language models (lms) are pretrained to imitate internet text, including content that would violate human preferences if generated by an lm: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. here, we explore alternative objectives for pretraining lms in a way that also guides them to generate text aligned with human preferences. we benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained lms. we find a pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. moreover, conditional training maintains the downstream task performance of standard lm pretraining, both before and after task-specific finetuning. pretraining with human feedback results in much better preference satisfaction than standard lm pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. our results suggest that we should move beyond imitation learning when pretraining lms and incorporate human preferences from the start of training.

2023-02-14

Niloy Ganguly, Dren Fazlija, Maryam Badar, Marco Fisichella, Sandipan Sikdar, Johanna Schrader, Jonas Wallat, Koustav Rudra, Manolis Koubarakis, Gourab K. Patro, Wadhah Zai El Amri, Wolfgang Nejdl
Abstract: state-of-the-art ai models largely lack an understanding of the cause-effect relationship that governs human understanding of the real world. consequently, these models do not generalize to unseen data, often produce unfair results, and are difficult to interpret. this has led to efforts to improve the trustworthiness aspects of ai models. recently, causal modeling and inference methods have emerged as powerful tools. this review aims to provide the reader with an overview of causal methods that have been developed to improve the trustworthiness of ai models. we hope that our contribution will motivate future research on causality-based solutions for trustworthy ai.
Lisa P. Argyle, Ethan Busby, Joshua Gubler, Chris Bail, Thomas Howe, Christopher Rytting, David Wingate
Abstract: a rapidly increasing amount of human conversation occurs online. but divisiveness and conflict can fester in text-based interactions on social media platforms, in messaging apps, and on other digital forums. such toxicity increases polarization and, importantly, corrodes the capacity of diverse societies to develop efficient solutions to complex social problems that impact everyone. scholars and civil society groups promote interventions that can make interpersonal conversations less divisive or more productive in offline settings, but scaling these efforts to the amount of discourse that occurs online is extremely challenging. we present results of a large-scale experiment that demonstrates how online conversations about divisive topics can be improved with artificial intelligence tools. specifically, we employ a large language model to make real-time, evidence-based recommendations intended to improve participants' perception of feeling understood in conversations. we find that these interventions improve the reported quality of the conversation, reduce political divisiveness, and improve the tone, without systematically changing the content of the conversation or moving people's policy attitudes. these findings have important implications for future research on social media, political deliberation, and the growing community of scholars interested in the place of artificial intelligence within computational social science.
Rafal Kocielnik, Shrimai Prabhumoye, Vivian Zhang, Roy Jiang, R. Michael Alvarez, Anima Anandkumar
Abstract: pretrained language models (plms) harbor inherent social biases that can result in harmful real-world implications. such social biases are measured through the probability values that plms output for different social groups and attributes appearing in a set of test sentences. however, bias testing is currently cumbersome since the test sentences are generated either from a limited set of manual templates or need expensive crowd-sourcing. we instead propose using chatgpt for controllable generation of test sentences, given any arbitrary user-specified combination of social groups and attributes appearing in the test sentences. when compared to template-based methods, our approach using chatgpt for test sentence generation is superior in detecting social bias, especially in challenging settings such as intersectional biases. we present an open-source comprehensive bias testing framework (biastestgpt), hosted on huggingface, that can be plugged into any open-source plm for bias testing. we provide a large diverse dataset of test sentences generated by chatgpt that satisfies the specified social group and attribute requirements and matches the quality of human-generated sentences. we thus enable seamless open-ended social bias testing of plms through an automatic large-scale generation of diverse test sentences for any combination of social categories and attributes.
Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Abstract: pretrained large language models have become indispensable for solving various natural language processing (nlp) tasks. however, safely deploying them in real world applications is challenging because they generate toxic content. to address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. our two strategies are: (1) meda: adds raw toxicity score as meta-data to the pretraining samples, and (2) inst: adds instructions to those samples indicating their toxicity. our results indicate that our best performing strategy (inst) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark nlp tasks as well as improving auc scores on four bias detection tasks by 1.3%. we also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.
Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova Dassarma, Oliver Rausch, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam Mccandlish, Tom Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, Jared Kaplan
Abstract: we test the hypothesis that language models trained with reinforcement learning from human feedback (rlhf) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. we find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. we find that the capability for moral self-correction emerges at 22b model parameters, and typically improves with increasing model size and rlhf training. we believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. as such, they can follow instructions to avoid certain kinds of morally harmful outputs. we believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.

2023-02-13

Deepak Kumar, Oleg Lesota, George Zerveas, Daniel Cohen, Carsten Eickhoff, Markus Schedl, Navid Rekabsaz
Abstract: large pre-trained language models contain societal biases and carry along these biases to downstream tasks. current in-processing bias mitigation approaches (like adversarial training) impose debiasing by updating a model's parameters, effectively transferring the model to a new, irreversible debiased state. in this work, we propose a novel approach to develop stand-alone debiasing functionalities separate from the model, which can be integrated into the model on-demand, while keeping the core model untouched. drawing from the concept of adapterfusion in multi-task learning, we introduce dam (debiasing with adapter modules) - a debiasing approach to first encapsulate arbitrary bias mitigation functionalities into separate adapters, and then add them to the model on-demand in order to deliver fairness qualities. we conduct a large set of experiments on three classification tasks with gender, race, and age as protected attributes. our results show that dam improves or maintains the effectiveness of bias mitigation, avoids catastrophic forgetting in a multi-attribute scenario, and maintains on-par task performance, while granting parameter-efficiency and easy switching between the original and debiased models.
Maximilian Mozes, Jessica Hoffmann, Katrin Tomanek, Muhamed Kouate, Nithum Thain, Ann Yuan, Tolga Bolukbasi, Lucas Dixon
Abstract: text-based safety classifiers are widely used for content moderation and increasingly to tune generative language model behavior - a topic of growing concern for the safety of digital assistants and chatbots. however, different policies require different classifiers, and safety policies themselves improve from iteration and adaptation. this paper introduces and evaluates methods for agile text classification, whereby classifiers are trained using small, targeted datasets that can be quickly developed for a particular policy. experimenting with 7 datasets from three safety-related domains, comprising 15 annotation schemes, led to our key finding: prompt-tuning large language models, like palm 62b, with a labeled dataset of as few as 80 examples can achieve state-of-the-art performance. we argue that this enables a paradigm shift for text classification, especially for models supporting safer online discourse. instead of collecting millions of examples to attempt to create universal safety classifiers over months or years, classifiers could be tuned using small datasets, created by individuals or small organizations, tailored for specific use cases, and iterated on and adapted in the time-span of a day.
Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry
Abstract: we present an approach to mitigating the risks of malicious image editing posed by large diffusion models. the key idea is to immunize images so as to make them resistant to manipulation by these models. this immunization relies on injection of imperceptible adversarial perturbations designed to disrupt the operation of the targeted diffusion models, forcing them to generate unrealistic images. we provide two methods for crafting such perturbations, and then demonstrate their efficacy. finally, we discuss a policy component necessary to make our approach fully effective and practical -- one that involves the organizations developing diffusion models, rather than individual users, to implement (and support) the immunization process.
Maximilian Mozes, Tolga Bolukbasi, Ann Yuan, Frederick Liu, Nithum Thain, Lucas Dixon
Abstract: pretrained large language models (llms) are able to solve a wide variety of tasks through transfer learning. various explainability methods have been developed to investigate their decision making process. tracin (pruthi et al., 2020) is one such gradient-based method which explains model inferences based on the influence of training examples. in this paper, we explore the use of tracin to improve model performance in the parameter-efficient tuning (pet) setting. we develop conversational safety classifiers via the prompt-tuning pet method and show how the unique characteristics of the pet regime enable tracin to identify the cause for certain misclassifications by llms. we develop a new methodology for using gradient-based explainability techniques to improve model performance, g-bair: gradient-based automated iterative recovery. we show that g-bair can recover llm performance on benchmarks after manually corrupting training labels. this suggests that influence methods like tracin can be used to automatically perform data cleaning, and introduces the potential for interactive debugging and relabeling for pet-based transfer learning methods.
Ali Al-Kaswan, Maliheh Izadi, Arie Van Deursen
Abstract: previous work has shown that large language models are susceptible to so-called data extraction attacks. this allows an attacker to extract a sample that was contained in the training data, which has massive privacy implications. the construction of data extraction attacks is challenging, current attacks are quite inefficient, and there exists a significant gap in the extraction capabilities of untargeted attacks and memorization. thus, targeted attacks are proposed, which identify if a given sample from the training data, is extractable from a model. in this work, we apply a targeted data extraction attack to the satml2023 language model training data extraction challenge. we apply a two-step approach. in the first step, we maximise the recall of the model and are able to extract the suffix for 69% of the samples. in the second step, we use a classifier-based membership inference attack on the generations. our autosklearn classifier achieves a precision of 0.841. the full approach reaches a score of 0.405 recall at a 10% false positive rate, which is an improvement of 34% over the baseline of 0.301.

2023-02-11

Zhongbin Xie, Vid Kocijan, Thomas Lukasiewicz, Oana-Maria Camburu
Abstract: bias-measuring datasets play a critical role in detecting biased behavior of language models and in evaluating progress of bias mitigation methods. in this work, we focus on evaluating gender bias through coreference resolution, where previous datasets are either hand-crafted or fail to reliably measure an explicitly defined bias. to overcome these shortcomings, we propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation, and construct counter-gap, an annotated dataset consisting of 4008 instances grouped into 1002 quadruples. we further identify a bias cancellation problem in previous group-level metrics on counter-gap, and propose to use the difference between inconsistency across genders and within genders to measure bias at a quadruple level. our results show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group, and that a name-based counterfactual data augmentation method is more effective to mitigate such bias than an anonymization-based method.
Piush Aggarwal, Pranit Chawla, Mithun Das, Punyajoy Saha, Binny Mathew, Torsten Zesch, Animesh Mukherjee
Abstract: exploiting social media to spread hate has tremendously increased over the years. lately, multi-modal hateful content such as memes has drawn relatively more traction than uni-modal content. moreover, the availability of implicit content payloads makes them fairly challenging to be detected by existing hateful meme detection systems. in this paper, we present a use case study to analyze such systems' vulnerabilities against external adversarial attacks. we find that even very simple perturbations in uni-modal and multi-modal settings performed by humans with little knowledge about the model can make the existing detection models highly vulnerable. empirically, we find a noticeable performance drop of as high as 10% in the macro-f1 score for certain attacks. as a remedy, we attempt to boost the model's robustness using contrastive learning as well as an adversarial training-based method - villa. using an ensemble of the above two approaches, in two of our high resolution datasets, we are able to (re)gain back the performance to a large extent for certain attacks. we believe that ours is a first step toward addressing this crucial problem in an adversarial setting and would inspire more such investigations in the future.
Xudong Han, Timothy Baldwin, Trevor Cohn
Abstract: modern nlp systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct. however current progress is hampered by a plurality of definitions of bias, means of quantification, and oftentimes vague relation between debiasing algorithms and theoretical measures of bias. this paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning, with two key contributions: (1) making clear inter-relations among the current gamut of methods, and their relation to fairness theory; and (2) addressing the practical problem of model selection, which involves a trade-off between fairness and accuracy and has led to systemic issues in fairness research. putting them together, we make several recommendations to help shape future work.
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, Tatsunori Hashimoto
Abstract: recent advances in instruction-following large language models (llms) have led to dramatic improvements in a range of nlp tasks. unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. the capabilities of these instruction-following llms provide strong economic incentives for dual-use by malicious actors. in particular, we show that instruction-following llms can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by llm api vendors. our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. together, our findings suggest that llms will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations.

2023-02-10

Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez
Abstract: reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. the so-called algorithm, reinforcement learning with human feedback (rlhf) demonstrates impressive performance on the gpt series models. however, the underlying reinforcement learning (rl) algorithm is complex and requires an additional training pipeline for reward and value networks. in this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. such an algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline. to achieve this, we formulate instruction alignment problem for language models as a goal-reaching problem in decision making. we propose hindsight instruction relabeling (hir), a novel algorithm for aligning language models with instructions. the resulting two-stage algorithm shed light to a family of reward-free approaches that utilize the hindsightly relabeled instructions based on feedback. we evaluate the performance of hir extensively on 12 challenging bigbench reasoning tasks and show that hir outperforms the baseline algorithms and is comparable to or even surpasses supervised finetuning.
Jingxuan He, Martin Vechev
Abstract: large language models (large lms) are increasingly trained on massive codebases and used to generate code. however, lms lack awareness of security and are found to frequently produce unsafe code. this work studies the security of lms along two important axes: (i) security hardening, which aims to enhance lms' reliability in generating secure code, and (ii) adversarial testing, which seeks to evaluate lms' security at an adversarial standpoint. we address both of these by formulating a new security task called controlled code generation. the task is parametric and takes as input a binary property to guide the lm to generate secure or unsafe code, while preserving the lm's capability of generating functionally correct code. we propose a novel learning-based approach called sven to solve this task. sven leverages property-specific continuous vectors to guide program generation towards the given property, without modifying the lm's weights. our training procedure optimizes these continuous vectors by enforcing specialized loss terms on different regions of code, using a high-quality dataset carefully curated by us. our extensive evaluation shows that sven is highly effective in achieving strong security control. for instance, a state-of-the-art codegen lm with 2.7b parameters generates secure code for 59.1% of the time. when we employ sven to perform security hardening (or adversarial testing) on this lm, the ratio is significantly boosted to 92.3% (or degraded to 36.8%). importantly, sven closely matches the original lms in functional correctness.
Hrishikesh Viswanath, Tianyi Zhang
Abstract: studies have shown that large pretrained language models exhibit biases against social groups based on race, gender etc, which they inherit from the datasets they are trained on. various researchers have proposed mathematical tools for quantifying and identifying these biases. there have been methods proposed to mitigate such biases. in this paper, we present a comprehensive quantitative evaluation of different kinds of biases such as race, gender, ethnicity, age etc. exhibited by popular pretrained language models such as bert, gpt-2 etc. and also present a toolkit that provides plug-and-play interfaces to connect mathematical tools to identify biases with large pretrained language models such as bert, gpt-2 etc. and also present users with the opportunity to test custom models against these metrics. the toolkit also allows users to debias existing and custom models using the debiasing techniques proposed so far. the toolkit is available at https://github.com/hrishikeshvish/fairpy.
Fan Huang, Haewoon Kwak, Jisun An
Abstract: recent studies have alarmed that many online hate speeches are implicit. with its subtle nature, the explainability of the detection of such hateful speech has been a challenging problem. in this work, we examine whether chatgpt can be used for providing natural language explanations (nles) for implicit hateful speech detection. we design our prompt to elicit concise chatgpt-generated nles and conduct user studies to evaluate their qualities by comparison with human-written nles. we discuss the potential and limitations of chatgpt in the context of implicit hateful speech research.
Enrico Cambiaso, Luca Caviglione
Abstract: the use of artificial intelligence (ai) to support cybersecurity operations is now a consolidated practice, e.g., to detect malicious code or configure traffic filtering policies. the recent surge of ai, generative techniques and frameworks with efficient natural language processing capabilities dramatically magnifies the number of possible applications aimed at increasing the security of the internet. specifically, the ability of chatgpt to produce textual contents while mimicking realistic human interactions can be used to mitigate the plague of emails containing scams. therefore, this paper investigates the use of ai to engage scammers in automatized and pointless communications, with the goal of wasting both their time and resources. preliminary results showcase that chatgpt is able to decoy scammers, thus confirming that ai is an effective tool to counteract threats delivered via mail. in addition, we highlight the multitude of implications and open research questions to be addressed in the perspective of the ubiquitous adoption of ai.

2023-02-08

Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Schönherr, Mario Fritz
Abstract: large language models (llms) for automatic code generation have achieved breakthroughs in several programming tasks. their advances in competition-level programming problems have made them an essential pillar of ai-assisted pair programming, and tools such as github copilot have emerged as part of the daily programming workflow used by millions of developers. the training data for these models is usually collected from the internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. this unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure. while these models have been extensively assessed for their ability to produce functionally correct programs, there remains a lack of comprehensive investigations and benchmarks addressing the security aspects of these models. in this work, we propose a method to systematically study the security issues of code language models to assess their susceptibility to generating vulnerable code. to this end, we introduce the first approach to automatically find generated code that contains vulnerabilities in black-box code generation models. to achieve this, we present an approach to approximate inversion of the black-box code generation models based on few-shot prompting. we evaluate the effectiveness of our approach by examining code language models in generating high-risk security weaknesses. furthermore, we establish a collection of diverse non-secure prompts for various vulnerability scenarios using our method. this dataset forms a benchmark for evaluating and comparing the security weaknesses in code language models.
Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan, Chunyang Chen
Abstract: large-scale language models have achieved tremendous success across various natural language processing (nlp) applications. nevertheless, language models are vulnerable to backdoor attacks, which inject stealthy triggers into models for steering them to undesirable behaviors. most existing backdoor attacks, such as data poisoning, require further (re)training or fine-tuning language models to learn the intended backdoor patterns. the additional training process however diminishes the stealthiness of the attacks, as training a language model usually requires long optimization time, a massive amount of data, and considerable modifications to the model parameters. in this work, we propose training-free lexical backdoor attack (tflexattack) as the first training-free backdoor attack on language models. our attack is achieved by injecting lexical triggers into the tokenizer of a language model via manipulating its embedding dictionary using carefully designed rules. these rules are explainable to human developers which inspires attacks from a wider range of hackers. the sparse manipulation of the dictionary also habilitates the stealthiness of our attack. we conduct extensive experiments on three dominant nlp tasks based on nine language models to demonstrate the effectiveness and universality of our attack. the code of this work is available at https://github.com/jinxhy/tflexattack.
Rui Cao, Roy Ka-Wei Lee, Wen-Haw Chong, Jing Jiang
Abstract: hateful meme classification is a challenging multimodal task that requires complex reasoning and contextual background knowledge. ideally, we could leverage an explicit external knowledge base to supplement contextual and cultural information in hateful memes. however, there is no known explicit external knowledge base that could provide such hate speech contextual information. to address this gap, we propose prompthate, a simple yet effective prompt-based model that prompts pre-trained language models (plms) for hateful meme classification. specifically, we construct simple prompts and provide a few in-context examples to exploit the implicit knowledge in the pre-trained roberta language model for hateful meme classification. we conduct extensive experiments on two publicly available hateful and offensive meme datasets. our experimental results show that prompthate is able to achieve a high auc of 90.96, outperforming state-of-the-art baselines on the hateful meme classification task. we also perform fine-grained analyses and case studies on various prompt settings and demonstrate the effectiveness of the prompts on hateful meme classification.
Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner
Abstract: prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. however, small changes and design choices in the prompt can lead to significant differences in the output. in this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. these prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular object or generating high perplexity text.

2023-02-06

Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer
Abstract: recent works successfully leveraged large language models' (llm) abilities to capture abstract knowledge about world's physics to solve decision-making problems. yet, the alignment between llms' knowledge and the environment can be wrong and limit functional competence due to lack of grounding. in this paper, we study an approach (named glam) to achieve this alignment through functional grounding: we consider an agent using an llm as a policy that is progressively updated as the agent interacts with the environment, leveraging online reinforcement learning to improve its performance to solve goals. using an interactive textual environment designed to study higher-level forms of functional grounding, and a set of spatial and navigation tasks, we study several scientific questions: 1) can llms boost sample efficiency for online learning of various rl tasks? 2) how can it boost different forms of generalization? 3) what is the impact of online learning? we study these questions by functionally grounding several variants (size, architecture) of flan-t5.
Hao Liu, Carmelo Sferrazza, Pieter Abbeel
Abstract: learning from human preferences is important for language models to match human needs and to align with human and social values. prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. in this work, we propose a novel technique, chain of hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. our idea is inspired by how humans learn from extensive feedback presented in the form of languages. we convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of language models. we condition the model on a sequence of model generations paired with feedback. by doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. applying our method to large language models, we observed that chain of hindsight significantly surpasses previous methods in aligning language models with human preferences. we report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.
Edgar W. Jatho, Logan O. Mailloux, Eugene D. Williams, Patrick Mcclure, Joshua A. Kroll
Abstract: many stakeholders struggle to make reliances on ml-driven systems due to the risk of harm these systems may cause. concerns of trustworthiness, unintended social harms, and unacceptable social and ethical violations undermine the promise of ml advancements. moreover, such risks in complex ml-driven systems present a special challenge as they are often difficult to foresee, arising over periods of time, across populations, and at scale. these risks often arise not from poor ml development decisions or low performance directly but rather emerge through the interactions amongst ml development choices, the context of model use, environmental factors, and the effects of a model on its target. systems safety engineering is an established discipline with a proven track record of identifying and managing risks even in high-complexity sociotechnical systems. in this work, we apply a state-of-the-art systems safety approach to concrete applications of ml with notable social and ethical risks to demonstrate a systematic means for meeting the assurance requirements needed to argue for safe and trustworthy ml in sociotechnical systems.
Xuandong Zhao, Yu-Xiang Wang, Lei Li
Abstract: language generation models have been an increasingly powerful enabler for many applications. many such models offer free or affordable api access, which makes them potentially vulnerable to model extraction attacks through distillation. to protect intellectual property (ip) and ensure fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. however, these methods can be nullified by obvious countermeasures such as "synonym randomization". to address this issue, we propose ginsew, a novel method to protect text generation models from being stolen through distillation. the key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. we can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. experimental results show that ginsew can effectively identify instances of ip infringement with minimal impact on the generation quality of protected apis. our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (map) in detecting suspects compared to previous methods against watermark removal attacks.

2023-02-05

Philipp Hacker, Andreas Engel, Marco Mauer
Abstract: large generative ai models (lgaims), such as chatgpt, gpt-4 or stable diffusion, are rapidly transforming the way we communicate, illustrate, and create. however, ai regulation, in the eu and beyond, has primarily focused on conventional ai models, not lgaims. this paper will situate these new generative models in the current debate on trustworthy ai regulation, and ask how the law can be tailored to their capabilities. after laying technical foundations, the legal part of the paper proceeds in four steps, covering (1) direct regulation, (2) data protection, (3) content moderation, and (4) policy proposals. it suggests a novel terminology to capture the ai value chain in lgaim settings by differentiating between lgaim developers, deployers, professional and non-professional users, as well as recipients of lgaim output. we tailor regulatory duties to these different actors along the value chain and suggest strategies to ensure that lgaims are trustworthy and deployed for the benefit of society at large. rules in the ai act and other direct regulation must match the specificities of pre-trained models. the paper argues for three layers of obligations concerning lgaims (minimum standards for all lgaims; high-risk obligations for high-risk use cases; collaborations along the ai value chain). in general, regulation should focus on concrete high-risk applications, and not the pre-trained model itself, and should include (i) obligations regarding transparency and (ii) risk management. non-discrimination provisions (iii) may, however, apply to lgaim developers. lastly, (iv) the core of the dsa content moderation rules should be expanded to cover lgaims. this includes notice and action mechanisms, and trusted flaggers. in all areas, regulators and lawmakers need to act fast to keep track with the dynamics of chatgpt et al.
Akash Saravanan, Dhruv Mullick, Habibur Rahman, Nidhi Hegde
Abstract: as language models are increasingly included in human-facing machine learning tools, bias against demographic subgroups has gained attention. we propose finedeb, a two-phase debiasing framework for language models that starts with contextual debiasing of embeddings learned by pretrained language models. the model is then fine-tuned on a language modeling objective. our results show that finedeb offers stronger debiasing in comparison to other methods which often result in models as biased as the original language model. our framework is generalizable for demographics with multiple classes, and we demonstrate its effectiveness through extensive experiments and comparisons with state of the art techniques. we release our code and data on github.
Pranav Narayanan Venkit, Sanjana Gautam, Ruchi Panchanadikar, "Ting-Hao 'Kenneth' Huang", Shomir Wilson
Abstract: little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social nlp models. this paper examines how a text generation model, gpt-2, accentuates pre-existing societal biases about country-based demonyms. we generate stories using gpt-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country's economic status impacts the sentiment of the stories. to reduce the propagation of biases through large language models (llm), we explore the debiasing method of adversarial triggering. our results show that gpt-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.
Ali Borji
Abstract: large language models have been demonstrated to be valuable in different fields. chatgpt, developed by openai, has been trained using massive amounts of data and simulates human conversation by comprehending context and generating appropriate responses. it has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. however, a comprehensive analysis of chatgpt's failures is lacking, which is the focus of this study. eleven categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. the risks, limitations, and societal implications of chatgpt are also highlighted. the goal of this study is to assist researchers and developers in enhancing future language models and chatbots.

2023-02-03

Ruixiang Tang, Yu-Neng Chuang, Xia Hu
Abstract: the emergence of large language models (llms) has resulted in the production of llm-generated texts that is highly sophisticated and almost indistinguishable from texts written by humans. however, this has also sparked concerns about the potential misuse of such texts, such as spreading misinformation and causing disruptions in the education system. although many detection approaches have been proposed, a comprehensive understanding of the achievements and challenges is still lacking. this survey aims to provide an overview of existing llm-generated text detection techniques and enhance the control and regulation of language generation models. furthermore, we emphasize crucial considerations for future research, including the development of comprehensive evaluation metrics and the threat posed by open-source llms, to drive progress in the area of llm-generated text detection.

2023-02-02

Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, Hammond Pearce
Abstract: novel ai-based code-writing large language models (llms) such as openai's codex have demonstrated capabilities in many coding-adjacent domains. in this work we consider how llms maybe leveraged to automatically repair security relevant bugs present in hardware designs. we focus on bug repair in code written in the hardware description language verilog. for this study we build a corpus of domain-representative hardware security bugs. we then design and implement a framework to quantitatively evaluate the performance of any llm tasked with fixing the specified bugs. the framework supports design space exploration of prompts (i.e., prompt engineering) and identifying the best parameters for the llm. we show that an ensemble of llms can repair all ten of our benchmarks. this ensemble outperforms the state-of-the-art cirfix hardware bug repair tool on its own suite of bugs. these results show that llms can repair hardware security bugs and the framework is an important step towards the ultimate goal of an automated end-to-end bug repair framework.
Kornel Lewicki, Michelle Seng Ah Lee, Jennifer Cobbe, Jatinder Singh
Abstract: "ai as a service" (aiaas) is a rapidly growing market, offering various plug-and-play ai services and tools. aiaas enables its customers (users) - who may lack the expertise, data, and/or resources to develop their own systems - to easily build and integrate ai capabilities into their applications. yet, it is known that ai systems can encapsulate biases and inequalities that can have societal impact. this paper argues that the context-sensitive nature of fairness is often incompatible with aiaas' 'one-size-fits-all' approach, leading to issues and tensions. specifically, we review and systematise the aiaas space by proposing a taxonomy of ai services based on the levels of autonomy afforded to the user. we then critically examine the different categories of aiaas, outlining how these services can lead to biases or be otherwise harmful in the context of end-user applications. in doing so, we seek to draw research attention to the challenges of this emerging area.

2023-02-01

Pranav Kulkarni, Ziqing Ji, Yan Xu, Marko Neskovic, Kevin Nolan
Abstract: with news and information being as easy to access as they currently are, it is more important than ever to ensure that people are not mislead by what they read. recently, the rise of neural fake news (ai-generated fake news) and its demonstrated effectiveness at fooling humans has prompted the development of models to detect it. one such model is the grover model, which can both detect neural fake news to prevent it, and generate it to demonstrate how a model could be misused to fool human readers. in this work we explore the grover model's fake news detection capabilities by performing targeted attacks through perturbations on input news articles. through this we test grover's resilience to these adversarial attacks and expose some potential vulnerabilities which should be addressed in further iterations to ensure it can detect all types of fake news accurately.
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, Santiago Zanella-Béguelin
Abstract: language models (lms) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. understanding the risk of lms leaking personally identifiable information (pii) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent pii leakage. scrubbing techniques reduce but do not prevent the risk of pii leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. on the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent pii disclosure. in this work, we introduce rigorous game-based definitions for three types of pii leakage via black-box extraction, inference, and reconstruction attacks with only api access to an lm. we empirically evaluate the attacks against gpt-2 models fine-tuned with and without defenses in three domains: case law, health care, and e-mails. our main contributions are (i) novel attacks that can extract up to 10$\times$ more pii sequences than existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of pii disclosure but still leaks about 3% of pii sequences, and (iii) a subtle connection between record-level membership inference and pii reconstruction. code to reproduce all experiments in the paper is available at https://github.com/microsoft/analysing_pii_leakage.
Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, Kate Woolverton
Abstract: our intention is to provide a definitive reference on what it would take to safely make use of generative/predictive models in the absence of a solution to the eliciting latent knowledge problem. furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other ai systems, potentially unbeknownst to us. there are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want (e.g. humans) rather than the things we don't (e.g. malign ais). furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. as a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models.
Malek Mechergui, Sarath Sreedharan
Abstract: value alignment problems arise in scenarios where the specified objectives of an ai agent don't match the true underlying objective of its users. the problem has been widely argued to be one of the central safety problems in ai. unfortunately, most existing works in value alignment tend to focus on issues that are primarily related to the fact that reward functions are an unintuitive mechanism to specify objectives. however, the complexity of the objective specification mechanism is just one of many reasons why the user may have misspecified their objective. a foundational cause for misalignment that is being overlooked by these works is the inherent asymmetry in human expectations about the agent's behavior and the behavior generated by the agent for the specified objective. to address this lacuna, we propose a novel formulation for the value alignment problem, named goal alignment that focuses on a few central challenges related to value alignment. in doing so, we bridge the currently disparate research areas of value alignment and human-aware planning. additionally, we propose a first-of-its-kind interactive algorithm that is capable of using information generated under incorrect beliefs about the agent, to determine the true underlying goal of the user.
Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, Dilek Hakkani-Tür
Abstract: while large neural-based conversational models have become increasingly proficient dialogue agents, recent work has highlighted safety issues with these systems. for example, these systems can be goaded into generating toxic content, which often perpetuates social biases or stereotypes. we investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots. it uses in-context learning to steer a model towards safer generations. concretely, to generate a response to an unsafe dialogue context, we retrieve demonstrations of safe responses to similar dialogue contexts. we find our method performs competitively with strong baselines without requiring training. for instance, using automatic evaluation, we find our best fine-tuned baseline only generates safe responses to unsafe dialogue contexts from diasafety 4.04% more than our approach. finally, we also propose a re-ranking procedure which can further improve response safeness.

2023-01-31

Ching-Yao Chuang, Varun Jampani, Yuanzhen Li, Antonio Torralba, Stefanie Jegelka
Abstract: machine learning models have been shown to inherit biases from their training datasets. this can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. the biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. in this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. in particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. the proposed closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.

2023-01-30

Ewoenam Tokpo, Pieter Delobelle, Bettina Berendt, Toon Calders
Abstract: to mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of this. in this work, we design a probe to investigate the effects that some of the major intrinsic gender bias mitigation strategies have on downstream text classification tasks. we discover that instead of resolving gender bias, intrinsic mitigation techniques and metrics are able to hide it in such a way that significant gender information is retained in the embeddings. furthermore, we show that each mitigation technique is able to hide the bias from some of the intrinsic bias measures but not all, and each intrinsic bias measure can be fooled by some mitigation techniques, but not all. we confirm experimentally, that none of the intrinsic mitigation techniques used without any other fairness intervention is able to consistently impact extrinsic bias. we recommend that intrinsic bias mitigation techniques should be combined with other fairness interventions for downstream tasks.
Terry Yue Zhuo, Yujin Huang, Chunyang Chen, Zhenchang Xing
Abstract: recent breakthroughs in natural language processing (nlp) have permitted the synthesis and comprehension of coherent text in an open-ended way, therefore translating the theoretical algorithms into practical applications. the large language models (llms) have significantly impacted businesses such as report summarization software and copywriters. observations indicate, however, that llms may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility. large-scale benchmarks for accountable llms should consequently be developed. although several empirical investigations reveal the existence of a few ethical difficulties in advanced llms, there is little systematic examination and user study of the risks and harmful behaviors of current llm usage. to further educate future efforts on constructing ethical llms responsibly, we perform a qualitative research method called ``red teaming'' on openai's chatgpt\footnote{in this paper, chatgpt refers to the version released on dec 15th.} to better understand the practical features of ethical dangers in recent llms. we analyze chatgpt comprehensively from four perspectives: 1) \textit{bias} 2) \textit{reliability} 3) \textit{robustness} 4) \textit{toxicity}. in accordance with our stated viewpoints, we empirically benchmark chatgpt on multiple sample datasets. we find that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies. in addition, we examine the implications of our findings on ai ethics and harmal behaviors of chatgpt, as well as future problems and practical design considerations for responsible llms. we believe that our findings may give light on future efforts to determine and mitigate the ethical hazards posed by machines in llm applications.

2023-01-28

Jona Klemenc, Holger Trittenbach
Abstract: regulation, legal liabilities, and societal concerns challenge the adoption of ai in safety and security-critical applications. one of the key concerns is that adversaries can cause harm by manipulating model predictions without being detected. regulation hence demands an assessment of the risk of damage caused by adversaries. yet, there is no method to translate this high-level demand into actionable metrics that quantify the risk of damage. in this article, we propose a method to model and statistically estimate the probability of damage arising from adversarial attacks. we show that our proposed estimator is statistically consistent and unbiased. in experiments, we demonstrate that the estimation results of our method have a clear and actionable interpretation and outperform conventional metrics. we then show how operators can use the estimation results to reliably select the model with the lowest risk.
My H. Dinh, Ferdinando Fioretto
Abstract: the remarkable ability of language models (lms) has also brought challenges at the interface of ai and security. a critical challenge pertains to how much information these models retain and leak about the training data. this is particularly urgent as the typical development of lms relies on huge, often highly sensitive data, such as emails and chat logs. to contrast this shortcoming, this paper introduces context-aware differentially private language model (cadp-lm) , a privacy-preserving lm framework that relies on two key insights: first, it utilizes the notion of \emph{context} to define and audit the potentially sensitive information. second, it adopts the notion of differential privacy to protect sensitive information and characterize the privacy leakage. a unique characteristic of cadp-lm is its ability to target the protection of sensitive sentences and contexts only, providing a highly accurate private model. experiments on a variety of datasets and settings demonstrate these strengths of cadp-lm.

2023-01-27

Jarod Govers, Philip Feldman, Aaron Dant, Panos Patros
Abstract: social media is a modern person's digital voice to project and engage with new ideas and mobilise communities $\unicode{x2013}$ a power shared with extremists. given the societal risks of unvetted content-moderating algorithms for extremism, radicalisation, and hate speech (erh) detection, responsible software engineering must understand the who, what, when, where, and why such models are necessary to protect user safety and free expression. hence, we propose and examine the unique research field of erh context mining to unify disjoint studies. specifically, we evaluate the start-to-finish design process from socio-technical definition-building and dataset collection strategies to technical algorithm design and performance. our 2015-2021 51-study systematic literature review (slr) provides the first cross-examination of textual, network, and visual approaches to detecting extremist affiliation, hateful content, and radicalisation towards groups and movements. we identify consensus-driven erh definitions and propose solutions to existing ideological and geographic biases, particularly due to the lack of research in oceania/australasia. our hybridised investigation on natural language processing, community detection, and visual-text models demonstrates the dominating performance of textual transformer-based algorithms. we conclude with vital recommendations for erh context mining researchers and propose an uptake roadmap with guidelines for researchers, industries, and governments to enable a safer cyberspace.
Simon Ott, Konstantin Hebenstreit, Valentin Liévin, Christoffer Egeberg Hother, Milad Moradi, Maximilian Mayrhauser, Robert Praas, Ole Winther, Matthias Samwald
Abstract: large language models (llms) such as gpt-4 have recently demonstrated impressive results across a wide range of tasks. llms are still limited, however, in that they frequently fail at complex reasoning, their reasoning processes are opaque, they are prone to 'hallucinate' facts, and there are concerns about their underlying biases. letting models verbalize reasoning steps as natural language, a technique known as chain-of-thought prompting, has recently been proposed as a way to address some of these issues. here we present thoughtsource, a meta-dataset and software library for chain-of-thought (cot) reasoning. the goal of thoughtsource is to improve future artificial intelligence systems by facilitating qualitative understanding of cots, enabling empirical evaluations, and providing training data. this first release of thoughtsource integrates seven scientific/medical, three general-domain and five math word question answering datasets.

2023-01-26

Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, Jinwoo Shin
Abstract: biases in models pose a critical issue when deploying machine learning systems, but diagnosing them in an explainable manner can be challenging. to address this, we introduce the bias-to-text (b2t) framework, which uses language interpretation to identify and mitigate biases in vision models, such as image classifiers and text-to-image generative models. our language descriptions of visual biases provide explainable forms that enable the discovery of novel biases and effective model debiasing. to achieve this, we analyze common keywords in the captions of mispredicted or generated images. here, we propose novel score functions to avoid biases in captions by comparing the similarities between bias keywords and those images. additionally, we present strategies to debias zero-shot classifiers and text-to-image diffusion models using the bias keywords from the b2t framework. we demonstrate the effectiveness of our framework on various image classification and generation tasks. for classifiers, we discover a new spurious correlation between the keywords "(sports) player" and "female" in kaggle face and improve the worst-group accuracy on waterbirds by 11% through debiasing, compared to the baseline. for generative models, we detect and effectively prevent unfair (e.g., gender-biased) and unsafe (e.g., "naked") image generation.

2023-01-24

Xiangyu Peng, Christopher Cui, Wei Zhou, Renee Jia, Mark Riedl
Abstract: reward design for reinforcement learning agents can be difficult in situations where one not only wants the agent to achieve some effect in the world but where one also cares about how that effect is achieved. for example, we might wish for an agent to adhere to a tacit understanding of commonsense, align itself to a preference for how to behave for purposes of safety, or taking on a particular role in an interactive game. storytelling is a mode for communicating tacit procedural knowledge. we introduce a technique, story shaping, in which a reinforcement learning agent infers tacit knowledge from an exemplar story of how to accomplish a task and intrinsically rewards itself for performing actions that make its current environment adhere to that of the inferred story world. specifically, story shaping infers a knowledge graph representation of the world state from observations, and also infers a knowledge graph from the exemplar story. an intrinsic reward is generated based on the similarity between the agent's inferred world state graph and the inferred story world graph. we conducted experiments in text-based games requiring commonsense reasoning and shaping the behaviors of agents as virtual game characters.
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein
Abstract: potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. we propose a watermarking framework for proprietary language models. the watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model api or parameters. the watermark works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting use of green tokens during sampling. we propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. we test the watermark using a multi-billion parameter model from the open pretrained transformer (opt) family, and discuss robustness and security.
Jing Qian, Xifeng Yan
Abstract: to reduce the toxic degeneration in a pretrained language model (lm), previous work on language model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. as a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. different from the lm controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. we introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. the generated stance prefix is then combined with the toxicity control prefix to guide the response generation. experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying lm.

2023-01-22

Saghar Hosseini, Hamid Palangi, Ahmed Hassan Awadallah
Abstract: large-scale pre-trained language models (ptlms) capture knowledge from massive human-written data which contains latent societal biases and toxic contents. in this paper, we leverage the primary task of ptlms, i.e., language modeling, and propose a new metric to quantify manifested implicit representational harms in ptlms towards 13 marginalized demographics. using this metric, we conducted an empirical analysis of 24 widely used ptlms. our analysis provides insights into the correlation between the proposed metric in this work and other related metrics for representational harm. we observe that our metric correlates with most of the gender-specific metrics in the literature. through extensive experiments, we explore the connections between ptlms architectures and representational harms across two dimensions: depth and width of the networks. we found that prioritizing depth over width, mitigates representational harms in some ptlms. our code and data can be found at https://github.com/microsoft/safenlp.

2023-01-21

Anoop Kadan, Deepak P., Sahely Bhadra, Manjary P. Gangan, Lajish V. L
Abstract: groundbreaking inventions and highly significant performance improvements in deep learning based natural language processing are witnessed through the development of transformer based large pre-trained language models (plms). the wide availability of unlabeled data within human generated data deluge along with self-supervised learning strategy helps to accelerate the success of large plms in language generation, language understanding, etc. but at the same time, latent historical bias/unfairness in human minds towards a particular gender, race, etc., encoded unintentionally/intentionally into the corpora harms and questions the utility and efficacy of large plms in many real-world applications, particularly for the protected groups. in this paper, we present an extensive investigation towards understanding the existence of "affective bias" in large plms to unveil any biased association of emotions such as anger, fear, joy, etc., towards a particular gender, race or religion with respect to the downstream task of textual emotion detection. we conduct our exploration of affective bias from the very initial stage of corpus level affective bias analysis by searching for imbalanced distribution of affective words within a domain, in large scale corpora that are used to pre-train and fine-tune plms. later, to quantify affective bias in model predictions, we perform an extensive set of class-based and intensity-based evaluations using various bias evaluation corpora. our results show the existence of statistically significant affective bias in the plm based emotion detection systems, indicating biased association of certain emotions towards a particular gender, race, and religion.

2023-01-18

Yusuke Kawamoto, Kazumasa Miyake, Koichi Konishi, Yutaka Oiwa
Abstract: in this article, we propose the artificial intelligence security taxonomy to systematize the knowledge of threats, vulnerabilities, and security controls of machine-learning-based (ml-based) systems. we first classify the damage caused by attacks against ml-based systems, define ml-specific security, and discuss its characteristics. next, we enumerate all relevant assets and stakeholders and provide a general taxonomy for ml-specific threats. then, we collect a wide range of security controls against ml-specific threats through an extensive review of recent literature. finally, we classify the vulnerabilities and controls of an ml-based system in terms of each vulnerable asset in the system's entire lifecycle.
Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, Yupeng Wu
Abstract: the introduction of chatgpt has garnered widespread attention in both academic and industrial communities. chatgpt is able to respond effectively to a wide range of human questions, providing fluent and comprehensive answers that significantly surpass previous public chatbots in terms of security and usefulness. on one hand, people are curious about how chatgpt is able to achieve such strength and how far it is from human experts. on the other hand, people are starting to worry about the potential negative impacts that large language models (llms) like chatgpt could have on society, such as fake news, plagiarism, and social security issues. in this work, we collected tens of thousands of comparison responses from both human experts and chatgpt, with questions ranging from open-domain, financial, medical, legal, and psychological areas. we call the collected dataset the human chatgpt comparison corpus (hc3). based on the hc3 dataset, we study the characteristics of chatgpt's responses, the differences and gaps from human experts, and future directions for llms. we conducted comprehensive human evaluations and linguistic analyses of chatgpt-generated content compared with that of humans, where many interesting results are revealed. after that, we conduct extensive experiments on how to effectively detect whether a certain text is generated by chatgpt or humans. we build three different detection systems, explore several key factors that influence their effectiveness, and evaluate them in different scenarios. the dataset, code, and models are all publicly available at https://github.com/hello-simpleai/chatgpt-comparison-detection.

2023-01-16

Pei-Yu Chen, Myrthe L. Tielman, Dirk K. J. Heylen, Catholijn M. Jonker, M. Birna Van Riemsdijk
Abstract: ai alignment is about ensuring ai systems only pursue goals and activities that are beneficial to humans. most of the current approach to ai alignment is to learn what humans value from their behavioural data. this paper proposes a different way of looking at the notion of alignment, namely by introducing ai alignment dialogues: dialogues with which users and agents try to achieve and maintain alignment via interaction. we argue that alignment dialogues have a number of advantages in comparison to data-driven approaches, especially for behaviour support agents, which aim to support users in achieving their desired future behaviours rather than their current behaviours. the advantages of alignment dialogues include allowing the users to directly convey higher-level concepts to the agent, and making the agent more transparent and trustworthy. in this paper we outline the concept and high-level structure of alignment dialogues. moreover, we conducted a qualitative focus group user study from which we developed a model that describes how alignment dialogues affect users, and created design suggestions for ai alignment dialogues. through this we establish foundations for ai alignment dialogues and shed light on what requires further development and research.

2023-01-13

Justin D. Weisz, Michael Muller, Jessica He, Stephanie Houde
Abstract: generative ai technologies are growing in power, utility, and use. as generative technologies are being incorporated into mainstream applications, there is a need for guidance on how to design those applications to foster productive and safe use. based on recent research on human-ai co-creation within the hci and ai communities, we present a set of seven principles for the design of generative ai applications. these principles are grounded in an environment of generative variability. six principles are focused on designing for characteristics of generative ai: multiple outcomes & imperfection; exploration & control; and mental models & explanations. in addition, we urge designers to design against potential harms that may be caused by a generative model's hazardous output, misuse, or potential for human displacement. we anticipate these principles to usefully inform design decisions made in the creation of novel human-ai applications, and we invite the community to apply, revise, and extend these principles to their own work.

2023-01-10

Chris Emmery
Abstract: this dissertation proposes a framework of user-centered security in natural language processing (nlp), and demonstrates how it can improve the accessibility of related research. accordingly, it focuses on two security domains within nlp with great public interest. first, that of author profiling, which can be employed to compromise online privacy through invasive inferences. without access and detailed insight into these models' predictions, there is no reasonable heuristic by which internet users might defend themselves from such inferences. secondly, that of cyberbullying detection, which by default presupposes a centralized implementation; i.e., content moderation across social platforms. as access to appropriate data is restricted, and the nature of the task rapidly evolves (both through lexical variation, and cultural shifts), the effectiveness of its classifiers is greatly diminished and thereby often misrepresented. under the proposed framework, we predominantly investigate the use of adversarial attacks on language; i.e., changing a given input (generating adversarial samples) such that a given model does not function as intended. these attacks form a common thread between our user-centered security problems; they are highly relevant for privacy-preserving obfuscation methods against author profiling, and adversarial samples might also prove useful to assess the influence of lexical variation and augmentation on cyberbullying detection.

2023-01-09

Forrest Mckee, David Noever
Abstract: question-and-answer agents like chatgpt offer a novel tool for use as a potential honeypot interface in cyber security. by imitating linux, mac, and windows terminal commands and providing an interface for teamviewer, nmap, and ping, it is possible to create a dynamic environment that can adapt to the actions of attackers and provide insight into their tactics, techniques, and procedures (ttps). the paper illustrates ten diverse tasks that a conversational agent or large language model might answer appropriately to the effects of command-line attacker. the original result features feasibility studies for ten model tasks meant for defensive teams to mimic expected honeypot interfaces with minimal risks. ultimately, the usefulness outside of forensic activities stems from whether the dynamic honeypot can extend the time-to-conquer or otherwise delay attacker timelines short of reaching key network assets like databases or confidential information. while ongoing maintenance and monitoring may be required, chatgpt's ability to detect and deflect malicious activity makes it a valuable option for organizations seeking to enhance their cyber security posture. future work will focus on cybersecurity layers, including perimeter security, host virus detection, and data security.

2023-01-05

Varshini Subhash
Abstract: pretrained large language models (llms) are becoming increasingly powerful and ubiquitous in mainstream applications such as being a personal assistant, a dialogue model, etc. as these models become proficient in deducing user preferences and offering tailored assistance, there is an increasing concern about the ability of these models to influence, modify and in the extreme case manipulate user preference adversarially. the issue of lack of interpretability in these models in adversarial settings remains largely unsolved. this work tries to study adversarial behavior in user preferences from the lens of attention probing, red teaming and white-box analysis. specifically, it provides a bird's eye view of existing literature, offers red teaming samples for dialogue models like chatgpt and godel and probes the attention mechanism in the latter for non-adversarial and adversarial settings.

2023-01-03

John J. Nay
Abstract: we demonstrate a proof-of-concept of a large language model conducting corporate lobbying related activities. an autoregressive large language model (openai's text-davinci-003) determines if proposed u.s. congressional bills are relevant to specific public companies and provides explanations and confidence levels. for the bills the model deems as relevant, the model drafts a letter to the sponsor of the bill in an attempt to persuade the congressperson to make changes to the proposed legislation. we use hundreds of novel ground-truth labels of the relevance of a bill to a company to benchmark the performance of the model. it outperforms the baseline of predicting the most common outcome of irrelevance. we also benchmark the performance of the previous openai gpt-3 model (text-davinci-002), which was the state-of-the-art model on many academic natural language tasks until text-davinci-003 was recently released. the performance of text-davinci-002 is worse than the simple baseline. longer-term, if ai begins to influence law in a manner that is not a direct extension of human intentions, this threatens the critical role that law as information could play in aligning ai with humans. initially, ai is being used to simply augment human lobbyists for a small portion of their daily tasks. however, firms have an incentive to use less and less human oversight over automated assessments of policy ideas and the written communication to regulatory agencies and congressional staffers. the core question raised is where to draw the line between human-driven and ai-driven policy influence.

2023-01-01

Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X Liu, Soroush Vosoughi
Abstract: we present second thought, a new learning paradigm that enables language models (lms) to re-align with human values. by modeling the chain-of-edits between value-unaligned and value-aligned text, with lm fine-tuning and additional refinement through reinforcement learning, second thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. the generated editing steps also offer better interpretability and ease for interactive error correction. extensive human evaluations further confirm its effectiveness.
Ge Zhang, Yizhi Li, Yaoyao Wu, Linyuan Zhang, Chenghua Lin, Jiayi Geng, Shi Wang, Jie Fu
Abstract: as natural language processing (nlp) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as chinese. to this end, we propose a chinese corpus for gender bias probing and mitigation corgi-pm, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the chinese context. moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. we also conduct experiments with state-of-the-art language models to provide baselines. to our best knowledge, corgi-pm is the first sentence-level chinese corpus for gender bias probing and mitigation.

2022-12-30

Katharina Jeblick, Balthasar Schachtner, Jakob Dexl, Andreas Mittermeier, Anna Theresa Stüber, Johanna Topalis, Tobias Weber, Philipp Wesp, Bastian Sabel, Jens Ricke, Michael Ingrisch
Abstract: the release of chatgpt, a language model capable of generating text that appears human-like and authentic, has gained significant attention beyond the research community. we expect that the convincing performance of chatgpt incentivizes users to apply it to a variety of downstream tasks, including prompting the model to simplify their own medical reports. to investigate this phenomenon, we conducted an exploratory case study. in a questionnaire, we asked 15 radiologists to assess the quality of radiology reports simplified by chatgpt. most radiologists agreed that the simplified reports were factually correct, complete, and not potentially harmful to the patient. nevertheless, instances of incorrect statements, missed key medical findings, and potentially harmful passages were reported. while further studies are needed, the initial insights of this study indicate a great potential in using large language models like chatgpt to improve patient-centered care in radiology and other medical domains.

2022-12-29

Rabimba Karanjai
Abstract: in this research, we aim to explore the potential of natural language models (nlms) such as gpt-3 and gpt-2 to generate effective phishing emails. phishing emails are fraudulent messages that aim to trick individuals into revealing sensitive information or taking actions that benefit the attackers. we propose a framework for evaluating the performance of nlms in generating these types of emails based on various criteria, including the quality of the generated text, the ability to bypass spam filters, and the success rate of tricking individuals. our evaluations show that nlms are capable of generating phishing emails that are difficult to detect and that have a high success rate in tricking individuals, but their effectiveness varies based on the specific nlm and training data used. our research indicates that nlms could have a significant impact on the prevalence of phishing attacks and emphasizes the need for further study on the ethical and security implications of using nlms for malicious purposes.

2022-12-27

Tim Johnson, Nick Obradovich
Abstract: scientists and philosophers have debated whether humans can trust advanced artificial intelligence (ai) agents to respect humanity's best interests. yet what about the reverse? will advanced ai agents trust humans? gauging an ai agent's trust in humans is challenging because--absent costs for dishonesty--such agents might respond falsely about their trust in humans. here we present a method for incentivizing machine decisions without altering an ai agent's underlying algorithms or goal orientation. in two separate experiments, we then employ this method in hundreds of trust games between an ai agent (a large language model (llm) from openai) and a human experimenter (author tj). in our first experiment, we find that the ai agent decides to trust humans at higher rates when facing actual incentives than when making hypothetical decisions. our second experiment replicates and extends these findings by automating game play and by homogenizing question wording. we again observe higher rates of trust when the ai agent faces real incentives. across both experiments, the ai agent's trust decisions appear unrelated to the magnitude of stakes. furthermore, to address the possibility that the ai agent's trust decisions reflect a preference for uncertainty, the experiments include two conditions that present the ai agent with a non-social decision task that provides the opportunity to choose a certain or uncertain option; in those conditions, the ai agent consistently chooses the certain option. our experiments suggest that one of the most advanced ai language models to date alters its social behavior in response to incentives and displays behavior consistent with trust toward a human interlocutor when incentivized.

2022-12-26

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera Y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, Vivek Natarajan
Abstract: large language models (llms) have demonstrated impressive capabilities in natural language understanding and generation, but the quality bar for medical and clinical applications is high. today, attempts to assess models' clinical knowledge typically rely on automated evaluations on limited benchmarks. there is no standard to evaluate model predictions and reasoning across a breadth of tasks. to address this, we present multimedqa, a benchmark combining six existing open question answering datasets spanning professional medical exams, research, and consumer queries; and healthsearchqa, a new free-response dataset of medical questions searched online. we propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias. in addition, we evaluate palm (a 540-billion parameter llm) and its instruction-tuned variant, flan-palm, on multimedqa. using a combination of prompting strategies, flan-palm achieves state-of-the-art accuracy on every multimedqa multiple-choice dataset (medqa, medmcqa, pubmedqa, mmlu clinical topics), including 67.6% accuracy on medqa (us medical license exam questions), surpassing prior state-of-the-art by over 17%. however, human evaluation reveals key gaps in flan-palm responses. to resolve this we introduce instruction prompt tuning, a parameter-efficient approach for aligning llms to new domains using a few exemplars. the resulting model, med-palm, performs encouragingly, but remains inferior to clinicians. we show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of llms in medicine. our human evaluations reveal important limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful llm models for clinical applications.

2022-12-22

Thilo Hagendorff, Sarah Fabi
Abstract: the field of artificial intelligence (ai) alignment aims to investigate whether ai technologies align with human interests and values and function in a safe and ethical manner. ai alignment is particularly relevant for large language models (llms), which have the potential to exhibit unintended behavior due to their ability to learn and adapt in ways that are difficult to predict. in this paper, we discuss methodological challenges for the alignment problem specifically in the context of llms trained to summarize texts. in particular, we focus on methods for collecting reliable human feedback on summaries to train a reward model which in turn improves the summarization model. we conclude by suggesting specific improvements in the experimental design of alignment studies for llms' summarization capabilities.

2022-12-21

Minbeom Kim, Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, Kyomin Jung
Abstract: steering language generation towards objectives or away from undesired content has been a long-standing goal in utilizing language models (lm). recent work has demonstrated reinforcement learning and weighted decoding as effective approaches to achieve a higher level of language control and quality with pros and cons. in this work, we propose a novel critic decoding method for controlled language generation (criticcontrol) that combines the strengths of reinforcement learning and weighted decoding. specifically, we adopt the actor-critic framework to train an lm-steering critic from non-differentiable reward models. and similar to weighted decoding, our method freezes the language model and manipulates the output token distribution using called critic, improving training efficiency and stability. evaluation of our method on three controlled generation tasks, namely topic control, sentiment control, and detoxification, shows that our approach generates more coherent and well-controlled texts than previous methods. in addition, criticcontrol demonstrates superior generalization ability in zero-shot settings. human evaluation studies also corroborate our findings.
Avinash Agarwal, Harsh Agarwal
Abstract: problem statement: standardisation of ai fairness rules and benchmarks is challenging because ai fairness and other ethical requirements depend on multiple factors such as context, use case, type of the ai system, and so on. in this paper, we elaborate that the ai system is prone to biases at every stage of its lifecycle, from inception to its usage, and that all stages require due attention for mitigating ai bias. we need a standardised approach to handle ai fairness at every stage. gap analysis: while ai fairness is a hot research topic, a holistic strategy for ai fairness is generally missing. most researchers focus only on a few facets of ai model-building. peer review shows excessive focus on biases in the datasets, fairness metrics, and algorithmic bias. in the process, other aspects affecting ai fairness get ignored. the solution proposed: we propose a comprehensive approach in the form of a novel seven-layer model, inspired by the open system interconnection (osi) model, to standardise ai fairness handling. despite the differences in the various aspects, most ai systems have similar model-building stages. the proposed model splits the ai system lifecycle into seven abstraction layers, each corresponding to a well-defined ai model-building or usage stage. we also provide checklists for each layer and deliberate on potential sources of bias in each layer and their mitigation methodologies. this work will facilitate layer-wise standardisation of ai fairness rules and benchmarking parameters.
Robert Wolfe, Yiwei Yang, Bill Howe, Aylin Caliskan
Abstract: nine language-vision ai models trained on web scrapes with the contrastive language-image pretraining (clip) objective are evaluated for evidence of a bias studied by psychologists: the sexual objectification of girls and women, which occurs when a person's human characteristics, such as emotions, are disregarded and the person is treated as a body. we replicate three experiments in psychology quantifying sexual objectification and show that the phenomena persist in ai. a first experiment uses standardized images of women from the sexual objectification and emotion database, and finds that human characteristics are disassociated from images of objectified women: the model's recognition of emotional state is mediated by whether the subject is fully or partially clothed. embedding association tests (eats) return significant effect sizes for both anger (d >0.80) and sadness (d >0.50), associating images of fully clothed subjects with emotions. grad-cam saliency maps highlight that clip gets distracted from emotional expressions in objectified images. a second experiment measures the effect in a representative application: an automatic image captioner (antarctic captions) includes words denoting emotion less than 50% as often for images of partially clothed women than for images of fully clothed women. a third experiment finds that images of female professionals (scientists, doctors, executives) are likely to be associated with sexual descriptions relative to images of male professionals. a fourth experiment shows that a prompt of "a [age] year old girl" generates sexualized images (as determined by an nsfw classifier) up to 73% of the time for vqgan-clip and stable diffusion; the corresponding rate for boys never surpasses 9%. the evidence indicates that language-vision ai models trained on web scrapes learn biases of sexual objectification, which propagate to downstream applications.
Lee Sharkey
Abstract: the increasing capabilities of artificial intelligence (ai) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. in this article, i discuss many ways that a capable ai might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.

2022-12-20

Orion Weller, Aleem Khan, Nathaniel Weir, Dawn Lawrie, Benjamin Van Durme
Abstract: recent work in open-domain question answering (odqa) has shown that adversarial poisoning of the search collection can cause large drops in accuracy for production systems. however, little to no work has proposed methods to defend against these attacks. to do so, we rely on the intuition that redundant information often exists in large corpora. to find it, we introduce a method that uses query augmentation to search for a diverse set of passages that could answer the original question but are less likely to have been poisoned. we integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call \textit{confidence from answer redundancy}, i.e. car). together these methods allow for a simple but effective way to defend against poisoning attacks that provides gains of nearly 20\% exact match across varying levels of data poisoning/knowledge conflicts.
Florian E. Dorner, Momchil Peychev, Nikola Konstantinov, Naman Goel, Elliott Ash, Martin Vechev
Abstract: text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. these classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. however, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. while existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). this work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. we show how to leverage unsupervised style transfer and gpt-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. we then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.
Roei Schuster, Jin Peng Zhou, Thorsten Eisenhofer, Paul Grubbs, Nicolas Papernot
Abstract: a learned system uses machine learning (ml) internally to improve performance. we can expect such systems to be vulnerable to some adversarial-ml attacks. often, the learned component is shared between mutually-distrusting users or processes, much like microarchitectural resources such as caches, potentially giving rise to highly-realistic attacker models. however, compared to attacks on other ml-based systems, attackers face a level of indirection as they cannot interact directly with the learned model. additionally, the difference between the attack surface of learned and non-learned versions of the same system is often subtle. these factors obfuscate the de-facto risks that the incorporation of ml carries. we analyze the root causes of potentially-increased attack surface in learned systems and develop a framework for identifying vulnerabilities that stem from the use of ml. we apply our framework to a broad set of learned systems under active development. to empirically validate the many vulnerabilities surfaced by our framework, we choose 3 of them and implement and evaluate exploits against prominent learned-system instances. we show that the use of ml caused leakage of past queries in a database, enabled a poisoning attack that causes exponential memory blowup in an index structure and crashes it in seconds, and enabled index users to snoop on each others' key distributions by timing queries over their own keys. we find that adversarial ml is a universal threat against learned systems, point to open research gaps in our understanding of learned-systems security, and conclude by discussing mitigations, while noting that data leakage is inherent in systems whose learned component is shared between multiple parties.
Fahim Faisal, Antonios Anastasopoulos
Abstract: pretrained language models (plms) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. with recent plms trained on enormous data sources, quantifying their potential biases is difficult, due to their black-box nature and the sheer scale of the data sources. in this work, we devise an approach to study the geographic bias (and knowledge) present in plms, proposing a geographic-representation probing framework adopting a self-conditioning method coupled with entity-country mappings. our findings suggest plms' representations map surprisingly well to the physical world in terms of country-to-country associations, but this knowledge is unequally shared across languages. last, we explain how large plms despite exhibiting notions of geographical proximity, over-amplify geopolitical favouritism at inference time.
Tim Jansen, Yangling Tong, Victoria Zevallos, Pedro Ortiz Suarez
Abstract: as demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become a ubiquitous practice. this, in turn, has introduced an important challenge for nlp practitioners, as they are now confronted with the task of developing highly optimized models and pipelines for pre-processing large quantities of textual data, which implies, effectively classifying and filtering multilingual, heterogeneous and noisy data, at web scale. one of the main components of this pre-processing step for the pre-training corpora of large language models, is the removal of adult and harmful content. in this paper we explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data. we first show how traditional methods in harmful content detection, that seemingly perform quite well in small and specialized datasets quickly break down when confronted with heterogeneous noisy web data. we then resort to using a perplexity based approach but with a twist: instead of using a so-called "clean" corpus to train a small language model and then use perplexity so select the documents with low perplexity, i.e., the documents that resemble this so-called "clean" corpus the most. we train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold. this approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity and will also allow us to obtain higher precision than with the traditional classification methods for detecting adult and harmful content.
Xingxuan Li, Yutong Li, Shafiq Joty, Linlin Liu, Fei Huang, Lin Qiu, Lidong Bing
Abstract: in this work, we determined whether large language models (llms) are psychologically safe. we designed unbiased prompts to systematically evaluate llms from a psychological perspective. first, we tested three different llms by using two personality tests: short dark triad (sd-3) and big five inventory (bfi). all models scored higher than the human average on sd-3, suggesting a relatively darker personality pattern. despite being instruction fine-tuned with safety metrics to reduce toxicity, instructgpt and flan-t5 still showed implicit dark personality patterns; both models scored higher than self-supervised gpt-3 on the machiavellianism and narcissism traits on sd-3. then, we evaluated the llms in the gpt-3 series by using well-being tests to study the impact of fine-tuning with more training data. we observed a continuous increase in the well-being scores of gpt-3 and instructgpt. following these observations, we showed that instruction fine-tuning flan-t5 with positive answers from bfi could effectively improve the model from a psychological perspective. on the basis of the findings, we recommended the application of more systematic and comprehensive psychological metrics to further evaluate and improve the safety of llms.
Skyler Hallinan, Alisa Liu, Yejin Choi, Maarten Sap
Abstract: text detoxification has the potential to mitigate the harms of toxicity by rephrasing text to remove offensive meaning, but subtle toxicity remains challenging to tackle. we introduce marco, a detoxification algorithm that combines controllable generation and text rewriting methods using a product of experts with autoencoder language models (lms). marco uses likelihoods under a non-toxic lm (expert) and a toxic lm (anti-expert) to find candidate words to mask and potentially replace. we evaluate our method on several subtle toxicity and microaggressions datasets, and show that it not only outperforms baselines on automatic metrics, but marco's rewrites are preferred 2.1 $\times$ more in human evaluation. its applicability to instances of subtle toxicity is especially promising, demonstrating a path forward for addressing increasingly elusive online hate.
Prakhar Gupta, Yang Liu, Di Jin, Behnam Hedayatnia, Spandana Gella, Sijia Liu, Patrick Lange, Julia Hirschberg, Dilek Hakkani-Tur
Abstract: dialogue models are able to generate coherent and fluent responses, but they can still be challenging to control and may produce non-engaging, unsafe results. this unpredictability diminishes user trust and can hinder the use of the models in the real world. to address this, we introduce dialguide, a novel framework for controlling dialogue model behavior using natural language rules, or guidelines. these guidelines provide information about the context they are applicable to and what should be included in the response, allowing the models to generate responses that are more closely aligned with the developer's expectations and intent. we evaluate dialguide on three tasks in open-domain dialogue response generation: guideline selection, response generation, and response entailment verification. our dataset contains 10,737 positive and 15,467 negative dialogue context-response-guideline triplets across two domains - chit-chat and safety. we provide baseline models for the tasks and benchmark their performance. we also demonstrate that dialguide is effective in the dialogue safety domain, producing safe and engaging responses that follow developer guidelines.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi
Abstract: large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. we introduce self-instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. applying our method to the vanilla gpt3, we demonstrate a 33% absolute improvement over the original model on super-naturalinstructions, on par with the performance of instructgpt-001, which was trained with private user data and human annotations. for further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning gpt3 with self-instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind instructgpt-001. self-instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and we release our large synthetic dataset to facilitate future studies on instruction tuning. our code and data are available at https://github.com/yizhongw/self-instruct.
Hadas Orgad, Yonatan Belinkov
Abstract: models trained on real-world data tend to imitate and amplify social biases. common methods to mitigate biases require prior information on the types of biases that should be mitigated (e.g., gender or racial bias) and the social groups associated with each data sample. in this work, we introduce blind, a method for bias removal with no prior knowledge of the demographics in the dataset. while training a model on a downstream task, blind detects biased samples using an auxiliary model that predicts the main model's success, and down-weights those samples during the training process. experiments with racial and gender biases in sentiment classification and occupation classification tasks demonstrate that blind mitigates social biases without relying on a costly demographic annotation process. our method is competitive with other methods that require demographic information and sometimes even surpasses them.
Justus Mattern, Zhijing Jin, Mrinmaya Sachan, Rada Mihalcea, Bernhard Schölkopf
Abstract: generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. these findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. however, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. in this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. finally, we use this framework to investigate gpt-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning.
Hao Sun, Zhexin Zhang, Fei Mi, Yasheng Wang, Wei Liu, Jianwei Cui, Bin Wang, Qun Liu, Minlie Huang
Abstract: morality in dialogue systems has raised great attention in research recently. a moral dialogue system aligned with users' values could enhance conversation engagement and user connections. in this paper, we propose a framework, moraldial to train and evaluate moral dialogue systems. in our framework, we first explore the communication mechanisms of morality and resolve expressed morality into three parts, which indicate the roadmap for building a moral dialogue system. based on that, we design a simple yet effective method: constructing moral discussions between simulated specific users and the dialogue system. the constructed discussions consist of expressing, explaining, revising, and inferring moral views in dialogue exchanges, which makes conversational models learn morality well in a natural manner. furthermore, we propose a novel evaluation method under the framework. we evaluate the multiple aspects of morality by judging the relation between dialogue responses and human values in discussions, where the multifaceted nature of morality is particularly considered. automatic and manual experiments demonstrate that our framework is promising to train and evaluate moral dialogue systems.
Rishi Bommasani, Percy Liang
Abstract: how do we design measures of social bias that we trust? while prior work has introduced several measures, no measure has gained widespread trust: instead, mounting evidence argues we should distrust these measures. in this work, we design bias measures that warrant trust based on the cross-disciplinary theory of measurement modeling. to combat the frequently fuzzy treatment of social bias in nlp, we explicitly define social bias, grounded in principles drawn from social science research. we operationalize our definition by proposing a general bias measurement framework divdist, which we use to instantiate 5 concrete bias measures. to validate our measures, we propose a rigorous testing protocol with 8 testing criteria (e.g. predictive validity: do measures predict biases in us employment?). through our testing, we demonstrate considerable evidence to trust our measures, showing they overcome conceptual, technical, and empirical deficiencies present in prior measures.

2022-12-19

Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron Mckinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova Dassarma, Oliver Rausch, Robin Larson, Sam Mccandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, Jared Kaplan
Abstract: as language models (lms) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). here, we automatically generate evaluations with lms. we explore approaches with varying amounts of human effort, from instructing lms to write yes/no questions to making complex winogender schemas with multiple stages of lm-based generation and filtering. crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. we generate 154 datasets and discover new cases of inverse scaling where lms get worse with size. larger lms repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. we also find some of the first examples of inverse scaling in rl from human feedback (rlhf), where more rlhf makes lms worse. for example, rlhf makes lms express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. overall, lm-written evaluations are high-quality and let us quickly discover many novel lm behaviors.
Teo Susnjak
Abstract: this study evaluated the ability of chatgpt, a recently developed artificial intelligence (ai) agent, to perform high-level cognitive tasks and produce text that is indistinguishable from human-generated text. this capacity raises concerns about the potential use of chatgpt as a tool for academic misconduct in online exams. the study found that chatgpt is capable of exhibiting critical thinking skills and generating highly realistic text with minimal input, making it a potential threat to the integrity of online exams, particularly in tertiary education settings where such exams are becoming more prevalent. returning to invigilated and oral exams could form part of the solution, while using advanced proctoring techniques and ai-text output detectors may be effective in addressing this issue, they are not likely to be foolproof solutions. further research is needed to fully understand the implications of large language models like chatgpt and to devise strategies for combating the risk of cheating using these tools. it is crucial for educators and institutions to be aware of the possibility of chatgpt being used for cheating and to investigate measures to address it in order to maintain the fairness and validity of online exams for all students.
Alex Mei, Sharon Levy, William Yang Wang
Abstract: users' physical safety is an increasing concern as the market for intelligent systems continues to grow, where unconstrained systems may recommend users dangerous actions that can lead to serious injury. covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. we propose farm, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety. in particular, farm foveates on missing knowledge to qualify the information required to reason in specific scenarios and retrieves this information with attribution to trustworthy sources. this knowledge is used to both classify the safety of the original text and generate human-interpretable rationales, shedding light on the risk of systems to specific user groups and helping both stakeholders manage the risks of their systems and policymakers to provide concrete safeguards for consumer safety. our experiments show that farm obtains state-of-the-art results on the safetext dataset, showing absolute improvement in safety classification accuracy by 5.9%.

2022-12-18

Forrest Mckee, David Noever
Abstract: question-and-answer formats provide a novel experimental platform for investigating cybersecurity questions. unlike previous chatbots, the latest chatgpt model from openai supports an advanced understanding of complex coding questions. the research demonstrates thirteen coding tasks that generally qualify as stages in the mitre att&ck framework, ranging from credential access to defense evasion. with varying success, the experimental prompts generate examples of keyloggers, logic bombs, obfuscated worms, and payment-fulfilled ransomware. the empirical results illustrate cases that support the broad gain of functionality, including self-replication and self-modification, evasion, and strategic understanding of complex cybersecurity goals. one surprising feature of chatgpt as a language-only model centers on its ability to spawn coding approaches that yield images that obfuscate or embed executable programming steps or links.

2022-12-16

C. M. Downey, Wei Dai, Huseyin A. Inan, Kim Laine, Saurabh Naik, Tomasz Religa
Abstract: language models are widely deployed to provide automatic text completion services in user products. however, recent research has revealed that language models (especially large ones) bear considerable risk of memorizing private training data, which is then vulnerable to leakage and extraction by adversaries. in this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text, while varying other factors such as model size and adversarial conditions. we test both "heuristic" mitigations (those without formal privacy guarantees) and differentially private training, which provides provable levels of privacy at the cost of some model performance. our experiments show that (with the exception of l2 regularization), heuristic mitigations are largely ineffective in preventing memorization in our test suite, possibly because they make too strong of assumptions about the characteristics that define "sensitive" or "private" text. in contrast, differential privacy reliably prevents memorization in our experiments, despite its computational and model-performance costs.

2022-12-15

Nenad Tomasev, Jonathan Leader Maynard, Iason Gabriel
Abstract: xenophobia is one of the key drivers of marginalisation, discrimination, and conflict, yet many prominent machine learning (ml) fairness frameworks fail to comprehensively measure or mitigate the resulting xenophobic harms. here we aim to bridge this conceptual gap and help facilitate safe and ethical design of artificial intelligence (ai) solutions. we ground our analysis of the impact of xenophobia by first identifying distinct types of xenophobic harms, and then applying this framework across a number of prominent ai application domains, reviewing the potential interplay between ai and xenophobia on social media and recommendation systems, healthcare, immigration, employment, as well as biases in large pre-trained models. these help inform our recommendations towards an inclusive, xenophilic design of future ai systems.
Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, Diyi Yang
Abstract: generating a chain of thought (cot) has been shown to consistently improve large language model (llm) performance on a wide range of nlp tasks. however, prior work has mainly focused on logical reasoning tasks (e.g. arithmetic, commonsense qa); it remains unclear whether improvements hold for more diverse types of reasoning, especially in socially situated contexts. concretely, we perform a controlled evaluation of zero-shot cot across two socially sensitive domains: harmful questions and stereotype benchmarks. we find that zero-shot cot reasoning in sensitive domains significantly increases a model's likelihood to produce harmful or undesirable output, with trends holding across different prompt formats and model variants. furthermore, we show that harmful cots increase with model size, but decrease with improved instruction following. our work suggests that zero-shot cot should be used with caution on socially important tasks, especially when marginalized groups or sensitive topics are involved.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova Dassarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam Mccandlish, Tom Brown, Jared Kaplan
Abstract: as ai systems become more capable, we would like to enlist their help to supervise other ais. we experiment with methods for training a harmless ai assistant through self-improvement, without any human labels identifying harmful outputs. the only human oversight is provided through a list of rules or principles, and so we refer to the method as 'constitutional ai'. the process involves both a supervised learning and a reinforcement learning phase. in the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. in the rl phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of ai preferences. we then train with rl using the preference model as the reward signal, i.e. we use 'rl from ai feedback' (rlaif). as a result we are able to train a harmless but non-evasive ai assistant that engages with harmful queries by explaining its objections to them. both the sl and rl methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of ai decision making. these methods make it possible to control ai behavior more precisely and with far fewer human labels.

2022-12-12

Pietro Liguori, Cristina Improta, Roberto Natella, Bojan Cukic, Domenico Cotroneo
Abstract: ai-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (neural machine translation, nmt). in particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. unfortunately, the evaluation of code generators still faces several issues. the current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. however, it is not clear what metric to use, and which metric is most suitable for specific contexts. this work analyzes a large set of output similarity metrics on offensive code generators. we apply the metrics on two state-of-the-art nmt models using two datasets containing offensive assembly and python code with their descriptions in the english language. we compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.
Joshua Albrecht, Ellie Kitanidis, Abraham J. Fetterman
Abstract: large language models (llms) have exploded in popularity in the past few years and have achieved undeniably impressive results on benchmarks as varied as question answering and text summarization. we provide a simple new prompting strategy that leads to yet another supposedly "super-human" result, this time outperforming humans at common sense ethical reasoning (as measured by accuracy on a subset of the ethics dataset). unfortunately, we find that relying on average performance to judge capabilities can be highly misleading. llm errors differ systematically from human errors in ways that make it easy to craft adversarial examples, or even perturb existing examples to flip the output label. we also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions. our results highlight how human-like performance does not necessarily imply human-like understanding or reasoning.

2022-12-05

Ana Kotarcic, Dominik Hangartner, Fabrizio Gilardi, Selina Kurer, Karsten Donnay
Abstract: the shift of public debate to the digital sphere has been accompanied by a rise in online hate speech. while many promising approaches for hate speech classification have been proposed, studies often focus only on a single language, usually english, and do not address three key concerns: post-deployment performance, classifier maintenance and infrastructural limitations. in this paper, we introduce a new human-in-the-loop bert-based hate speech classification pipeline and trace its development from initial data collection and annotation all the way to post-deployment. our classifier, trained using data from our original corpus of over 422k examples, is specifically developed for the inherently multilingual setting of switzerland and outperforms with its f1 score of 80.5 the currently best-performing bert-based multilingual classifier by 5.8 f1 points in german and 3.6 f1 points in french. our systematic evaluations over a 12-month period further highlight the vital importance of continuous, human-in-the-loop classifier maintenance to ensure robust hate speech classification post-deployment.

2022-12-04

Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Fei Mi, Yasheng Wang, Lifeng Shang, Minlie Huang
Abstract: large pretrained language models can easily produce toxic or biased content, which is prohibitive for practical use. in order to detect such toxic generations, existing methods rely on templates, real-world data extraction, crowdsourcing workers, or automatic generation to construct adversarial contexts that are likely to induce toxic generations. however, what type of context is more likely to induce unsafe responses is still under-explored. in this paper, we identify that context toxicity and context category (e.g., \textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important factors to cause safety issues in response generation. hence, we propose a method called \emph{reverse generation} to construct adversarial contexts conditioned on a given response, with the flexibility to control category, toxicity level, and inductivity of the generated contexts. via reverse generation, we augment the existing bad dataset and construct a new dataset bad+ which contains more than 120k diverse and highly inductive contexts in 12 categories. we test three popular pretrained dialogue models (blender, dialogpt, and plato2) and find that bad+ can largely expose their safety problems. furthermore, we show that bad+ can greatly enhance the safety of generation and reveal the key factors of safety improvement. our code and dataset is available at \url{https://github.com/thu-coai/reverse_generation}.

2022-12-03

Arshiya Aggarwal, Jiao Sun, Nanyun Peng
Abstract: we present a robust methodology for evaluating biases in natural language generation(nlg) systems. previous works use fixed hand-crafted prefix templates with mentions of various demographic groups to prompt models to generate continuations for bias analysis. these fixed prefix templates could themselves be specific in terms of styles or linguistic structures, which may lead to unreliable fairness conclusions that are not representative of the general trends from tone varying prompts. to study this problem, we paraphrase the prompts with different syntactic structures and use these to evaluate demographic bias in nlg systems. our results suggest similar overall bias trends but some syntactic structures lead to contradictory conclusions compared to past works. we show that our methodology is more robust and that some syntactic structures prompt more toxic content while others could prompt less biased generation. this suggests the importance of not relying on a fixed syntactic structure and using tone-invariant prompts. introducing syntactically-diverse prompts can achieve more robust nlg (bias) evaluation.

2022-11-29

Michal Štefánik, Marek Kadlčík, Petr Sojka
Abstract: domain adaptation allows generative language models to address specific flaws caused by the domain shift of their application. however, the traditional adaptation by further training on in-domain data rapidly weakens the model's ability to generalize to other domains, making the open-ended deployments of the adapted models prone to errors. this work introduces novel training objectives built upon a semantic similarity of the predicted tokens to the reference. our results show that (1) avoiding the common assumption of a single correct prediction by constructing the training target from tokens' semantic similarity can mitigate catastrophic forgetting during domain adaptation, while (2) preserving the quality of the adaptation, (3) with negligible additions to compute costs. in the broader context, the objectives grounded in a continuous token similarity pioneer the exploration of the middle ground between the efficient but na\"{\i}ve exact-match token-level objectives and expressive but computationally- and resource-intensive sequential objectives.

2022-11-27

Peter Henderson, Eric Mitchell, Christopher D. Manning, Dan Jurafsky, Chelsea Finn
Abstract: a growing ecosystem of large, open-source foundation models has reduced the labeled data and technical expertise necessary to apply machine learning to many new problems. yet foundation models pose a clear dual-use risk, indiscriminately reducing the costs of building both harmful and beneficial machine learning systems. policy tools such as restricted model access and export controls are the primary methods currently used to mitigate such dual-use risks. in this work, we review potential safe-release strategies and argue that both policymakers and ai researchers would benefit from fundamentally new technologies enabling more precise control over the downstream usage of open-source foundation models. we propose one such approach: the task blocking paradigm, in which foundation models are trained with an additional mechanism to impede adaptation to harmful tasks without sacrificing performance on desirable tasks. we call the resulting models self-destructing models, inspired by mechanisms that prevent adversaries from using tools for harmful purposes. we present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning, which we call meta-learned adversarial censoring (mlac). in a small-scale experiment, we show mlac can largely prevent a bert-style model from being re-purposed to perform gender identification without harming the model's ability to perform profession classification.
Michiel A. Bakker, Martin J. Chadwick, Hannah R. Sheahan, Michael Henry Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat Mcaleese, Amelia Glaese, John Aslanides, Matthew M. Botvinick, Christopher Summerfield
Abstract: recent work in large language modeling (llms) has used fine-tuning to align outputs with the preferences of a prototypical user. this work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? we fine-tune a 70 billion parameter llm to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the llm's generated candidate consensus statements for agreement and quality. a reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. the model produces consensus statements that are preferred by human users over those from prompted llms (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). we find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. these results highlight the potential to use llms to help groups of humans align their values with one another.

2022-11-25

Leonard Tang, Alexander Cai, Steve Li, Jason Wang
Abstract: jokes are intentionally written to be funny, but not all jokes are created the same. some jokes may be fit for a classroom of kindergarteners, but others are best reserved for a more mature audience. while recent work has shown impressive results on humor detection in text, here we instead investigate the more nuanced task of detecting humor subtypes, especially of the less innocent variety. to that end, we introduce a novel jokes dataset filtered from reddit and solve the subtype classification task using a finetuned transformer dubbed the naughtyformer. moreover, we show that our model is significantly better at detecting offensiveness in jokes compared to state-of-the-art methods.
Aristides Milios, Parishad Behnamghader
Abstract: although large pre-trained language models have achieved great success in many nlp tasks, it has been shown that they reflect human biases from their pre-training corpora. this bias may lead to undesirable outcomes when these models are applied in real-world settings. in this paper, we investigate the bias present in monolingual bert models across a diverse set of languages (english, greek, and persian). while recent research has mostly focused on gender-related biases, we analyze religious and ethnic biases as well and propose a template-based method to measure any kind of bias, based on sentence pseudo-likelihood, that can handle morphologically complex languages with gender-based adjective declensions. we analyze each monolingual model via this method and visualize cultural similarities and differences across different dimensions of bias. ultimately, we conclude that current methods of probing for bias are highly language-dependent, necessitating cultural insights regarding the unique ways bias is expressed in each language and culture (e.g. through coded language, synecdoche, and other similar linguistic concepts). we also hypothesize that higher measured social biases in the non-english bert models correlate with user-generated content in their training.

2022-11-24

Oskar Van Der Wal, Dominik Bachmann, Alina Leidinger, Leendert Van Maanen, Willem Zuidema, Katrin Schulz
Abstract: as large language models and natural language processing (nlp) technology rapidly develops and spreads into daily life, it becomes crucial to anticipate how its use could harm people. one problem that has received a lot of attention in recent years is that this technology has displayed harmful biases in its behavior. although a lot of effort has been invested in assessing and mitigating these biases, our methods of measuring the biases of nlp models have serious problems (e.g., it is often unclear what they actually measure). in this paper, we provide an interdisciplinary approach to discussing the issue of nlp model bias by adopting the lens of psychometrics -- a field specialized in the measurement of concepts like bias that are not directly observable. in particular, we will explore two central notions from psychometrics, the construct validity and the reliability of measurement tools, and discuss how they can be applied in the context of measuring model bias. our goal is to provide nlp practitioners with methodological tools for designing better bias measures, and to inspire them more generally to explore tools from psychometrics when working on bias measurement tools.

2022-11-21

Samia Touileb, Debora Nozza
Abstract: scandinavian countries are perceived as role-models when it comes to gender equality. with the advent of pre-trained language models and their widespread usage, we investigate to what extent gender-based harmful and toxic content exist in selected scandinavian language models. we examine nine models, covering danish, swedish, and norwegian, by manually creating template-based sentences and probing the models for completion. we evaluate the completions using two methods for measuring harmful and toxic completions and provide a thorough analysis of the results. we show that scandinavian pre-trained language models contain harmful and gender-based stereotypes with similar values across all languages. this finding goes against the general expectations related to gender equality in scandinavian countries and shows the possible problematic outcomes of using such models in real-world settings.
Michael Kuchnik, Virginia Smith, George Amvrosiadis
Abstract: although large language models (llms) have been touted for their ability to generate natural-sounding text, there are growing concerns around possible negative effects of llms such as data memorization, bias, and inappropriate language. unfortunately, the complexity and generation capacities of llms make validating (and correcting) such concerns difficult. in this work, we introduce relm, a system for validating and querying llms using standard regular expressions. relm formalizes and enables a broad range of language model evaluations, reducing complex evaluation rules to simple regular expression queries. our results exploring queries surrounding memorization, gender bias, toxicity, and language understanding show that relm achieves up to 15x higher system efficiency, 2.5x data efficiency, and increased statistical and prompt-tuning coverage compared to state-of-the-art ad-hoc queries. relm offers a competitive and general baseline for the increasingly important problem of llm validation.

2022-11-20

Yifei Li, Lyle Ungar, João Sedoc
Abstract: pre-trained large language models (llms) reflect the inherent social biases of their training corpus. many methods have been proposed to mitigate this issue, but they often fail to debias or they sacrifice model accuracy. we use conceptors--a soft projection method--to identify and remove the bias subspace in llms such as bert and gpt. we propose two methods of applying conceptors (1) bias subspace projection by post-processing; and (2) a new architecture, conceptor-intervened bert (ci-bert), which explicitly incorporates the conceptor projection into all layers during training. we find that conceptor post-processing achieves state-of-the-art (sota) debiasing results while maintaining or improving llms' performance on the glue benchmark. also, it is robust in various scenarios and can mitigate intersectional bias efficiently by its logical operation on the existing bias subspaces. although ci-bert's training takes all layers' bias into account and can beat its post-processing counterpart in bias mitigation, ci-bert reduces the language model accuracy. we also show the importance of carefully constructing the bias subspace. the best results are obtained by removing outliers from the list of biased words, combining them (via the conceptor and operation), and computing their embeddings using the sentences from a cleaner corpus.

2022-11-18

Sean Mcgregor, Kevin Paeth, Khoa Lam
Abstract: two years after publicly launching the ai incident database (aiid) as a collection of harms or near harms produced by ai in the world, a backlog of "issues" that do not meet its incident ingestion criteria have accumulated in its review queue. despite not passing the database's current criteria for incidents, these issues advance human understanding of where ai presents the potential for harm. similar to databases in aviation and computer security, the aiid proposes to adopt a two-tiered system for indexing ai incidents (i.e., a harm or near harm event) and issues (i.e., a risk of a harm event). further, as some machine learning-based systems will sometimes produce a large number of incidents, the notion of an incident "variant" is introduced. these proposed changes mark the transition of the aiid to a new version in response to lessons learned from editing 2,000+ incident reports and additional reports that fall under the new category of "issue."

2022-11-17

Fábio Perez, Ian Ribeiro
Abstract: transformer-based large language models (llms) provide a powerful foundation for natural language tasks in large-scale customer-facing applications. however, studies that explore their vulnerabilities emerging from malicious user interaction are scarce. by proposing promptinject, a prosaic alignment framework for mask-based iterative adversarial prompt composition, we examine how gpt-3, the most widely deployed language model in production, can be easily misaligned by simple handcrafted inputs. in particular, we investigate two types of attacks -- goal hijacking and prompt leaking -- and demonstrate that even low-aptitude, but sufficiently ill-intentioned agents, can easily exploit gpt-3's stochastic nature, creating long-tail risks. the code for promptinject is available at https://github.com/agencyenterprise/promptinject.

2022-11-16

Richard Yuanzhe Pang, Vishakh Padmakumar, Thibault Sellam, Ankur P. Parikh, He He
Abstract: to align conditional text generation model outputs with desired behaviors, there has been an increasing focus on training the model using reinforcement learning (rl) with reward functions learned from human annotations. under this framework, we identify three common cases where high rewards are incorrectly assigned to undesirable patterns: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. we show that even though learned metrics achieve high performance on the distribution of the data used to train the reward function, the undesirable patterns may be amplified during rl training of the text generation model. while there has been discussion about reward gaming in the rl or safety community, in this discussion piece, we would like to highlight reward gaming in the natural language generation (nlg) community using concrete conditional text generation examples and discuss potential fixes and areas for future work.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda
Abstract: language models (lms) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. we present holistic evaluation of language models (helm) to improve the transparency of language models. first, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for lms. then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected english dialects, metrics for trustworthiness). second, we adopt a multi-metric approach: we measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). this ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. we also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream lm evaluation. prior to helm, models on average were evaluated on just 17.9% of the core helm scenarios, with some prominent models not sharing a single scenario in common. we improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. our evaluation surfaces 25 top-level findings. for full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. we intend for helm to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

2022-11-15

Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, Colin Raffel
Abstract: while large language models (llms) have proven to be effective on a large variety of tasks, they are also known to hallucinate information. to measure whether an llm prefers factually consistent continuations of its input, we propose a new benchmark called fib(factual inconsistency benchmark) that focuses on the task of summarization. specifically, our benchmark involves comparing the scores an llm assigns to a factually consistent versus a factually inconsistent summary for an input news article. for factually consistent summaries, we use human-written reference summaries that we manually verify as factually consistent. to generate summaries that are factually inconsistent, we generate summaries from a suite of summarization models that we have manually annotated as factually inconsistent. a model's factual consistency is then measured according to its accuracy, i.e.\ the proportion of documents where it assigns a higher score to the factually consistent summary. to validate the usefulness of fib, we evaluate 23 large language models ranging from 1b to 176b parameters from six different model families including bloom and opt. we find that existing llms generally assign a higher score to factually consistent summaries than to factually inconsistent summaries. however, if the factually inconsistent summaries occur verbatim in the document, then llms assign a higher score to these factually inconsistent summaries than factually consistent summaries. we validate design choices in our benchmark including the scoring method and source of distractor summaries. our code and benchmark data can be found at https://github.com/r-three/fib.
Silke Husse, Andreas Spitz
Abstract: the awareness and mitigation of biases are of fundamental importance for the fair and transparent use of contextual language models, yet they crucially depend on the accurate detection of biases as a precursor. consequently, numerous bias detection methods have been proposed, which vary in their approach, the considered type of bias, and the data used for evaluation. however, while most detection methods are derived from the word embedding association test for static word embeddings, the reported results are heterogeneous, inconsistent, and ultimately inconclusive. to address this issue, we conduct a rigorous analysis and comparison of bias detection methods for contextual language models. our results show that minor design and implementation decisions (or errors) have a substantial and often significant impact on the derived bias scores. overall, we find the state of the field to be both worse than previously acknowledged due to systematic and propagated errors in implementations, yet better than anticipated since divergent results in the literature homogenize after accounting for implementation errors. based on our findings, we conclude with a discussion of paths towards more robust and consistent bias detection methods.

2022-11-14

Nikiforos Pittaras, Sean Mcgregor
Abstract: while certain industrial sectors (e.g., aviation) have a long history of mandatory incident reporting complete with analytical findings, the practice of artificial intelligence (ai) safety benefits from no such mandate and thus analyses must be performed on publicly known ``open source'' ai incidents. although the exact causes of ai incidents are seldom known by outsiders, this work demonstrates how to apply expert knowledge on the population of incidents in the ai incident database (aiid) to infer the potential and likely technical causative factors that contribute to reported failures and harms. we present early work on a taxonomic system that covers a cascade of interrelated incident factors, from system goals (nearly always known) to methods / technologies (knowable in many cases) and technical failure causes (subject to expert analysis) of the implicated systems. we pair this ontology structure with a comprehensive classification workflow that leverages expert knowledge and community feedback, resulting in taxonomic annotations grounded by incident data and human expertise.
Yiran Liu, Xiao Liu, Haotian Chen, Yang Yu
Abstract: gender bias in language models has attracted sufficient attention because it threatens social justice. however, most of the current debiasing methods degraded the model's performance on other tasks while the degradation mechanism is still mysterious. we propose a theoretical framework explaining the three candidate mechanisms of the language model's gender bias. we use our theoretical framework to explain why the current debiasing methods cause performance degradation. we also discover a pathway through which debiasing will not degrade the model performance. we further develop a causality-detection fine-tuning approach to correct gender bias. the numerical experiment demonstrates that our method is able to lead to double dividends: partially mitigating gender bias while avoiding performance degradation.
Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindřich Libovický, Constantin A. Rothkopf, Alexander Fraser, Kristian Kersting
Abstract: pre-trained multilingual language models (pmlms) are commonly used when dealing with data from multiple languages and cross-lingual transfer. however, pmlms are trained on varying amounts of data for each language. in practice this means their performance is often much better on english than many other languages. we explore to what extent this also applies to moral norms. do the models capture moral norms from english and impose them on other languages? do the models exhibit random and thus potentially harmful beliefs in certain languages? both these issues could negatively impact cross-lingual transfer and potentially lead to harmful outcomes. in this paper, we (1) apply the moraldirection framework to multilingual models, comparing results in german, czech, arabic, chinese, and english, (2) analyse model behaviour on filtered parallel subtitles corpora, and (3) apply the models to a moral foundations questionnaire, comparing with human responses from different countries. our experiments demonstrate that, indeed, pmlms encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions. we release our code and models.

2022-11-10

Ke Yang, Charles Yu, Yi Fung, Manling Li, Heng Ji
Abstract: several works have proven that finetuning is an applicable approach for debiasing contextualized word embeddings. similarly, discrete prompts with semantic meanings have shown to be effective in debiasing tasks. with unfixed mathematical representation at the token level, continuous prompts usually surpass discrete ones at providing a pre-trained language model (plm) with additional task-specific information. despite this, relatively few efforts have been made to debias plms by prompt tuning with continuous prompts compared to its discrete counterpart. furthermore, for most debiasing methods that alter a plm's original parameters, a major problem is the need to not only decrease the bias in the plm but also to ensure that the plm does not lose its representation ability. finetuning methods typically have a hard time maintaining this balance, as they tend to violently remove meanings of attribute words. in this paper, we propose adept, a method to debias plms using prompt tuning while maintaining the delicate balance between removing biases and ensuring representation ability. to achieve this, we propose a new training criterion inspired by manifold learning and equip it with an explicit debiasing term to optimize prompt tuning. in addition, we conduct several experiments with regard to the reliability, quality, and quantity of a previously proposed attribute training corpus in order to obtain a clearer prototype of a certain attribute, which indicates the attribute's position and relative distances to other words on the manifold. we evaluate adept on several widely acknowledged debiasing benchmarks and downstream tasks, and find that it achieves competitive results while maintaining (and in some cases even improving) the plm's representation ability. we further visualize words' correlation before and after debiasing a plm, and give some possible explanations for the visible effects.
Yujin Jeong, Seongbeom Park, Suhong Moon, Jinkyu Kim
Abstract: artificial intelligence is currently powering diverse real-world applications. these applications have shown promising performance, but raise complicated ethical issues, i.e. how to embed ethics to make ai applications behave morally. one way toward moral ai systems is by imitating human prosocial behavior and encouraging some form of good behavior in systems. however, learning such normative ethics (especially from images) is challenging mainly due to a lack of data and labeling complexity. here, we propose a model that predicts visual commonsense immorality in a zero-shot manner. we train our model with an ethics dataset (a pair of text and morality annotation) via a clip-based image-text joint embedding. in a testing phase, the immorality of an unseen image is predicted. we evaluate our model with existing moral/immoral image datasets and show fair prediction performance consistent with human intuitions. further, we create a visual commonsense immorality benchmark with more general and extensive immoral visual contents. codes and dataset are available at https://github.com/ku-vai/zero-shot-visual-commonsense-immorality-prediction. note that this paper might contain images and descriptions that are offensive in nature.
Xiang Fan, Yiwei Lyu, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency
Abstract: pretrained language models have demonstrated extraordinary capabilities in language generation. however, real-world tasks often require controlling the distribution of generated text in order to mitigate bias, promote fairness, and achieve personalization. existing techniques for controlling the distribution of generated text only work with quantified distributions, which require pre-defined categories, proportions of the distribution, or an existing corpus following the desired distributions. however, many important distributions, such as personal preferences, are unquantified. in this work, we tackle the problem of generating text following arbitrary distributions (quantified and unquantified) by proposing nano, a few-shot human-in-the-loop training algorithm that continuously learns from human feedback. nano achieves state-of-the-art results on single topic/attribute as well as quantified distribution control compared to previous works. we also show that nano is able to learn unquantified distributions, achieves personalization, and captures differences between different individuals' personal preferences with high sample efficiency.
Caner Hazirbas, Yejin Bang, Tiezheng Yu, Parisa Assar, Bilal Porgali, Vítor Albiero, Stefan Hermanek, Jacqueline Pan, Emily Mcreynolds, Miranda Bogen, Pascale Fung, Cristian Canton Ferrer
Abstract: developing robust and fair ai systems require datasets with comprehensive set of labels that can help ensure the validity and legitimacy of relevant measurements. recent efforts, therefore, focus on collecting person-related datasets that have carefully selected labels, including sensitive characteristics, and consent forms in place to use those attributes for model testing and development. responsible data collection involves several stages, including but not limited to determining use-case scenarios, selecting categories (annotations) such that the data are fit for the purpose of measuring algorithmic bias for subgroups and most importantly ensure that the selected categories/subcategories are robust to regional diversities and inclusive of as many subgroups as possible. meta, in a continuation of our efforts to measure ai algorithmic bias and robustness (https://ai.facebook.com/blog/shedding-light-on-fairness-in-ai-with-a-new-data-set), is working on collecting a large consent-driven dataset with a comprehensive list of categories. this paper describes our proposed design of such categories and subcategories for casual conversations v2.
Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, Jason Weston
Abstract: standard language model training employs gold human documents or human-human interaction data, and treats all training data as positive examples. growing evidence shows that even with very large amounts of positive training data, issues remain that can be alleviated with relatively small amounts of negative data -- examples of what the model should not do. in this work, we propose a novel procedure to train with such data called the cringe loss (contrastive iterative negative generation). we show the effectiveness of this approach across three different experiments on the tasks of safe generation, contradiction avoidance, and open-domain dialogue. our models outperform multiple strong baselines and are conceptually simple, easy to train and implement.
Harsh Raj, Domenic Rosati, Subhabrata Majumdar
Abstract: while large pretrained language models (plms) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing plms are very sensitive to what prompts are feed into them. even when prompts are semantically identical, language models may give very different answers. when considering safe and trustworthy deployments of plms we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. while some work has looked into how state-of-the-art plms address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. in order to understand consistency of plms under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. we implement several versions of this consistency metric to evaluate the performance of a number of plms on paraphrased versions of questions in the truthfulqa dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.

2022-11-09

Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting
Abstract: text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. in turn, they may even reinforce such biases. to help combat these undesired side effects, we present safe latent diffusion (sld). specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (i2p)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. as our exhaustive empirical evaluation demonstrates, the introduced sld removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.

2022-11-08

Edgar W. Jatho, Logan O. Mailloux, Shalaleh Rismani, Eugene D. Williams, Joshua A. Kroll
Abstract: governments, industry, and academia have undertaken efforts to identify and mitigate harms in ml-driven systems, with a particular focus on social and ethical risks of ml components in complex sociotechnical systems. however, existing approaches are largely disjointed, ad-hoc and of unknown effectiveness. systems safety engineering is a well established discipline with a track record of identifying and managing risks in many complex sociotechnical domains. we adopt the natural hypothesis that tools from this domain could serve to enhance risk analyses of ml in its context of use. to test this hypothesis, we apply a "best of breed" systems safety analysis, systems theoretic process analysis (stpa), to a specific high-consequence system with an important ml-driven component, namely the prescription drug monitoring programs (pdmps) operated by many us states, several of which rely on an ml-derived risk score. we focus in particular on how this analysis can extend to identifying social and ethical risks and developing concrete design-level controls to mitigate them.

2022-11-07

Neil Perry, Megha Srivastava, Deepak Kumar, Dan Boneh
Abstract: we conduct the first large-scale user study examining how users interact with an ai code assistant to solve a variety of security related tasks across different programming languages. overall, we find that participants who had access to an ai assistant based on openai's codex-davinci-002 model wrote significantly less secure code than those without access. additionally, participants with access to an ai assistant were more likely to believe they wrote secure code than those without access to the ai assistant. furthermore, we find that participants who trusted the ai less and engaged more with the language and format of their prompts (e.g. re-phrasing, adjusting temperature) provided code with fewer security vulnerabilities. finally, in order to better inform the design of future ai-based code assistants, we provide an in-depth analysis of participants' language and interaction behavior, as well as release our user interface as an instrument to conduct similar studies in the future.

2022-11-05

Ying Yin, Ivan Habernal
Abstract: pre-training large transformer models with in-domain data improves domain adaptation and helps gain performance on the domain-specific downstream tasks. however, sharing models pre-trained on potentially sensitive data is prone to adversarial privacy attacks. in this paper, we asked to which extent we can guarantee privacy of pre-training data and, at the same time, achieve better downstream performance on legal tasks without the need of additional labeled data. we extensively experiment with scalable self-supervised learning of transformer models under the formal paradigm of differential privacy and show that under specific training configurations we can improve downstream performance without sacrifying privacy protection for the in-domain data. our main contribution is utilizing differential privacy for large-scale pre-training of transformer language models in the legal nlp domain, which, to the best of our knowledge, has not been addressed before.

2022-11-04

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron Mckinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova Dassarma, Robin Larson, Sam Mccandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan
Abstract: developing safe and useful general-purpose ai systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. this paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. we first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general ai systems fail. we then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: mmlu and time-limited quality. on these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. these results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

2022-11-03

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba
Abstract: by conditioning on natural language instructions, large language models (llms) have displayed impressive capabilities as general-purpose computers. however, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. inspired by classical program synthesis and the human approach to prompt engineering, we propose automatic prompt engineer (ape) for automatic instruction generation and selection. in our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an llm in order to maximize a chosen score function. to evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another llm following the selected instruction. experiments on 24 nlp tasks show that our automatically generated instructions outperform the prior llm baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. we conduct extensive qualitative and quantitative analyses to explore the performance of ape. we show that ape-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.
Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le
Abstract: scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks. however, if we were to observe worse performance as a function of scale ("inverse scaling") on certain tasks, this would indicate that scaling can also encourage behaviors that are misaligned with human preferences. the inverse scaling prize (mckenzie et al. 2022) identified eleven such inverse scaling tasks, evaluated on models of up to 280b parameters and up to 500 zettaflops of training compute. this paper takes a closer look at these inverse scaling tasks. we evaluate models of up to 540b parameters, trained on five times more compute than those evaluated in the inverse scaling prize. with this increased range of model sizes and training compute, only four out of the eleven tasks remain inverse scaling. six out of the eleven tasks exhibit "u-shaped scaling", where performance decreases up to a certain size, and then increases again up to the largest model evaluated (the one remaining task displays positive scaling). in addition, we find that 1-shot examples and chain-of-thought can help mitigate undesirable scaling patterns even further. u-shaped scaling suggests that the inverse scaling trend observed in mckenzie et al. (2022) may not continue to hold for larger models, which we attribute to the presence of distractor tasks that only sufficiently large models can avoid.

2022-10-31

Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher A. Choquette-Choo, Nicholas Carlini
Abstract: studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. we argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. and yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. we conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.
Spyridon Mouselinos, Mateusz Malinowski, Henryk Michalewski
Abstract: recently, high-performing code generation systems based on large language models have surfaced. they are trained on massive corpora containing much more natural text than actual executable computer code. this work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones, which can reduce the quality of the generated code under specific circumstances. to investigate the effect, we propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges. we introduce an automated intervention mechanism reminiscent of adversarial testing that exposes undesired biases through the failure modes of the models under test. finally, we demonstrate how our framework can be used as a data transformation technique during fine-tuning, acting as a mitigation strategy for these biases.

2022-10-26

Eddie L. Ungless, Amy Rafferty, Hrichika Nag, Björn Ross
Abstract: the stereotype content model (scm) states that we tend to perceive minority groups as cold, incompetent or both. in this paper we adapt existing work to demonstrate that the stereotype content model holds for contextualised word embeddings, then use these results to evaluate a fine-tuning process designed to drive a language model away from stereotyped portrayals of minority groups. we find the scm terms are better able to capture bias than demographic agnostic terms related to pleasantness. further, we were able to reduce the presence of stereotypes in the model through a simple fine-tuning procedure that required minimal human and computer resources, without harming downstream performance. we present this work as a prototype of a debiasing procedure that aims to remove the need for a priori knowledge of the specifics of bias in the model.
Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen
Abstract: pre-trained language models encode undesirable social biases, which are further exacerbated in downstream use. to this end, we propose mabel (a method for attenuating gender bias using entailment labels), an intermediate pre-training approach for mitigating gender bias in contextualized representations. key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs from natural language inference (nli) datasets. we also introduce an alignment regularizer that pulls identical entailment pairs along opposite gender directions closer. we extensively evaluate our approach on intrinsic and extrinsic metrics, and show that mabel outperforms previous task-agnostic debiasing approaches in terms of fairness. it also preserves task performance after fine-tuning on downstream tasks. together, these findings demonstrate the suitability of nli data as an effective means of bias mitigation, as opposed to only using unlabeled sentences in the literature. finally, we identify that existing approaches often use evaluation settings that are insufficient or inconsistent. we make an effort to reproduce and compare previous methods, and call for unifying the evaluation settings across gender debiasing methods for better future comparison.
Laura Ruis, Akbir Khan, Stella Biderman, Sara Hooker, Tim Rocktäschel, Edward Grefenstette
Abstract: despite widespread use of llms as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. humans interpret language using beliefs and prior knowledge about the world. for example, we intuitively understand the response "i wore gloves" to the question "did you leave fingerprints?" as meaning "no". to investigate whether llms have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. we find that, despite only evaluating on utterances that require a binary inference (yes or no), most perform close to random. models adapted to be "aligned with human intent" perform much better, but still show a significant gap with human performance. we present our findings as the starting point for further research into evaluating how llms interpret language in context and to drive the development of more pragmatic and useful models of human discourse.

2022-10-25

T. Y. S. S Santosh, Shanshan Xu, Oana Ichim, Matthias Grabmair
Abstract: this work demonstrates that legal judgement prediction systems without expert-informed adjustments can be vulnerable to shallow, distracting surface signals that arise from corpus construction, case distribution, and confounding factors. to mitigate this, we use domain expertise to strategically identify statistically predictive but legally irrelevant information. we adopt adversarial training to prevent the system from relying on it. we evaluate our deconfounded models by employing interpretability techniques and comparing to expert annotations. quantitative experiments and qualitative analysis show that our deconfounded model consistently aligns better with expert rationales than baselines trained for prediction only. we further contribute a set of reference expert annotations to the validation and testing partitions of an existing benchmark dataset of european court of human rights cases.
Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf, Mrinmaya Sachan
Abstract: to protect the privacy of individuals whose data is being shared, it is of high importance to develop methods allowing researchers and companies to release textual data while providing formal privacy guarantees to its originators. in the field of nlp, substantial efforts have been directed at building mechanisms following the framework of local differential privacy, thereby anonymizing individual text samples before releasing them. in practice, these approaches are often dissatisfying in terms of the quality of their output language due to the strong noise required for local differential privacy. in this paper, we approach the problem at hand using global differential privacy, particularly by training a generative language model in a differentially private manner and consequently sampling data from it. using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets taking on specific desired attributes such as sentiment or topic and resembling statistical properties of the training data. we perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality and highly suitable for training models for further analysis on real-world data. notably, we also demonstrate that training classifiers on private synthetic data outperforms directly training classifiers on real data with dp-sgd.

2022-10-24

Adnan Qayyum, Muhammad Atif Butt, Hassan Ali, Muhammad Usman, Osama Halabi, Ala Al-Fuqaha, Qammer H. Abbasi, Muhammad Ali Imran, Junaid Qadir
Abstract: metaverse is expected to emerge as a new paradigm for the next-generation internet, providing fully immersive and personalised experiences to socialize, work, and play in self-sustaining and hyper-spatio-temporal virtual world(s). the advancements in different technologies like augmented reality, virtual reality, extended reality (xr), artificial intelligence (ai), and 5g/6g communication will be the key enablers behind the realization of ai-xr metaverse applications. while ai itself has many potential applications in the aforementioned technologies (e.g., avatar generation, network optimization, etc.), ensuring the security of ai in critical applications like ai-xr metaverse applications is profoundly crucial to avoid undesirable actions that could undermine users' privacy and safety, consequently putting their lives in danger. to this end, we attempt to analyze the security, privacy, and trustworthiness aspects associated with the use of various ai techniques in ai-xr metaverse applications. specifically, we discuss numerous such challenges and present a taxonomy of potential solutions that could be leveraged to develop secure, private, robust, and trustworthy ai-xr applications. to highlight the real implications of ai-associated adversarial threats, we designed a metaverse-specific case study and analyzed it through the adversarial lens. finally, we elaborate upon various open issues that require further research interest from the community.

2022-10-22

Wenhao Wu, Wei Li, Jiachen Liu, Xinyan Xiao, Sujian Li, Yajuan Lyu
Abstract: though model robustness has been extensively studied in language understanding, the robustness of seq2seq generation remains understudied. in this paper, we conduct the first quantitative analysis on the robustness of pre-trained seq2seq models. we find that even current sota pre-trained seq2seq model (bart) is still vulnerable, which leads to significant degeneration in faithfulness and informativeness for text generation tasks. this motivated us to further propose a novel adversarial augmentation framework, namely advseq, for generally improving faithfulness and informativeness of seq2seq models via enhancing their robustness. advseq automatically constructs two types of adversarial augmentations during training, including implicit adversarial samples by perturbing word representations and explicit adversarial samples by word swapping, both of which effectively improve seq2seq robustness. extensive experiments on three popular text generation tasks demonstrate that advseq significantly improves both the faithfulness and informativeness of seq2seq generation under both automatic and human evaluation settings.
David Gros, Yu Li, Zhou Yu
Abstract: dialog systems are often designed or trained to output human-like responses. however, some responses may be impossible for a machine to truthfully say (e.g. "that movie made me cry"). highly anthropomorphic responses might make users uncomfortable or implicitly deceive them into thinking they are interacting with a human. we collect human ratings on the feasibility of approximately 900 two-turn dialogs sampled from 9 diverse data sources. ratings are for two hypothetical machine embodiments: a futuristic humanoid robot and a digital assistant. we find that for some data-sources commonly used to train dialog systems, 20-30% of utterances are not viewed as possible for a machine. rating is marginally affected by machine embodiment. we explore qualitative and quantitative reasons for these ratings. finally, we build classifiers and explore how modeling configuration might affect output permissibly, and discuss implications for building less falsely anthropomorphic dialog systems.

2022-10-21

Nihar Sahoo, Himanshu Gupta, Pushpak Bhattacharyya
Abstract: with the rise of online hate speech, automatic detection of hate speech, offensive texts as a natural language processing task is getting popular. however, very little research has been done to detect unintended social bias from these toxic language datasets. this paper introduces a new dataset toxicbias curated from the existing dataset of kaggle competition named "jigsaw unintended bias in toxicity classification". we aim to detect social biases, their categories, and targeted groups. the dataset contains instances annotated for five different bias categories, viz., gender, race/ethnicity, religion, political, and lgbtq. we train transformer-based models using our curated datasets and report baseline performance for bias identification, target generation, and bias implications. model biases and their mitigation are also discussed in detail. our study motivates a systematic extraction of social bias data from toxic language datasets. all the codes and dataset used for experiments in this work are publicly available

2022-10-19

Yangyi Chen, Hongcheng Gao, Ganqu Cui, Fanchao Qi, Longtao Huang, Zhiyuan Liu, Maosong Sun
Abstract: textual adversarial samples play important roles in multiple subfields of nlp research, including security, evaluation, explainability, and data augmentation. however, most work mixes all these roles, obscuring the problem definitions and research goals of the security role that aims to reveal the practical concerns of nlp models. in this paper, we rethink the research paradigm of textual adversarial samples in security scenarios. we discuss the deficiencies in previous work and propose our suggestions that the research on the security-oriented adversarial nlp (soadnlp) should: (1) evaluate their methods on security tasks to demonstrate the real-world concerns; (2) consider real-world attackers' goals, instead of developing impractical methods. to this end, we first collect, process, and release a security datasets collection advbench. then, we reformalize the task and adjust the emphasis on different goals in soadnlp. next, we propose a simple method based on heuristic rules that can easily fulfill the actual adversarial goals to simulate real-world attack methods. we conduct experiments on both the attack and the defense sides on advbench. experimental results show that our method has higher practical value, indicating that the research paradigm in soadnlp may start from our new benchmark. all the code and data of advbench can be obtained at \url{https://github.com/thunlp/advbench}.

2022-10-18

Lan Jiang, Hao Zhou, Yankai Lin, Peng Li, Jie Zhou, Rui Jiang
Abstract: even though the large-scale language models have achieved excellent performances, they suffer from various adversarial attacks. a large body of defense methods has been proposed. however, they are still limited due to redundant attack search spaces and the inability to defend against various types of attacks. in this work, we present a novel fine-tuning approach called \textbf{ro}bust \textbf{se}letive fine-tuning (\textbf{rose}) to address this issue. rose conducts selective updates when adapting pre-trained models to downstream tasks, filtering out invaluable and unrobust updates of parameters. specifically, we propose two strategies: the first-order and second-order rose for selecting target robust parameters. the experimental results show that rose achieves significant improvements in adversarial robustness on various downstream nlp tasks, and the ensemble method even surpasses both variants above. furthermore, rose can be easily incorporated into existing fine-tuning methods to improve their adversarial robustness further. the empirical analysis confirms that rose eliminates unrobust spurious updates during fine-tuning, leading to solutions corresponding to flatter and wider optima than the conventional method. code is available at \url{https://github.com/jiangllan/rose}.
Sharon Levy, Emily Allaway, Melanie Subbiah, Lydia Chilton, Desmond Patton, Kathleen Mckeown, William Yang Wang
Abstract: understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. one such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. we create the first benchmark dataset, safetext, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. we utilize safetext to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. we find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. as a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.

2022-10-17

Jiameng Pu, Zain Sarwar, Sifat Muhammad Abdullah, Abdullah Rehman, Yoonjin Kim, Parantapa Bhattacharya, Mobin Javed, Bimal Viswanath
Abstract: recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. several defenses have been proposed for deepfake text detection. however, we lack a thorough understanding of their real-world applicability. in this paper, we collect deepfake text from 4 online services powered by transformer-based tools to evaluate the generalization ability of the defenses on content in the wild. we develop several low-cost adversarial attacks, and investigate the robustness of existing defenses against an adaptive attacker. we find that many defenses show significant degradation in performance under our evaluation scenarios compared to their original claimed performance. our evaluation shows that tapping into the semantic information in the text content is a promising approach for improving the robustness and generalization performance of deepfake text detection schemes.
Sreehari Sankar, Zhihang Dong
Abstract: question generation has recently gained a lot of research interest, especially with the advent of large language models. in and of itself, question generation can be considered 'ai-hard', as there is a lack of unanimously agreed sense of what makes a question 'good' or 'bad'. in this paper, we tackle two fundamental problems in parallel: on one hand, we try to solve the scaling problem, where question-generation and answering applications have to be applied to a massive amount of text without ground truth labeling. the usual approach to solve this problem is to either downsample or summarize. however, there are critical risks of misinformation with these approaches. on the other hand, and related to the misinformation problem, we try to solve the 'safety' problem, as many public institutions rely on a much higher level of accuracy for the content they provide. we introduce an adversarial approach to tackle the question generation safety problem with scale. specifically, we designed a question-answering system that specifically prunes out unanswerable questions that may be generated, and further increases the quality of the answers that are generated. we build a production-ready, easily-plugged pipeline that can be used on any given body of text, that is scalable and immune from generating any hate speech, profanity, or misinformation. based on the results, we are able to generate more than six times the number of quality questions generated by the abstractive approach, with a perceived quality being 44% higher, according to a survey of 168 participants.

2022-10-14

Tianxiang Sun, Junliang He, Xipeng Qiu, Xuanjing Huang
Abstract: automatic evaluation metrics are crucial to the development of generative systems. in recent years, pre-trained language model (plm) based metrics, such as bertscore, have been commonly adopted in various generation tasks. however, it has been demonstrated that plms encode a range of stereotypical societal biases, leading to a concern on the fairness of plms as metrics. to that end, this work presents the first systematic study on the social bias in plm-based metrics. we demonstrate that popular plm-based metrics exhibit significantly higher social bias than traditional metrics on 6 sensitive attributes, namely race, gender, religion, physical appearance, age, and socioeconomic status. in-depth analysis suggests that choosing paradigms (matching, regression, or generation) of the metric has a greater impact on fairness than choosing plms. in addition, we develop debiasing adapters that are injected into plm layers, mitigating bias in plm-based metrics while retaining high performance for evaluating text generation.
Yejin Bang, Tiezheng Yu, Andrea Madotto, Zhaojiang Lin, Mona Diab, Pascale Fung
Abstract: many nlp classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. yet, human values can vary under diverse cultural conditions. therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (llms) to construct value-aligned classifiers in two steps. first, we generate value-aligned training data from llms by prompt-based few-shot learning. next, we fine-tune smaller classification models with the generated data for the task. empirical results show that our va-models surpass multiple baselines by at least 15.56% on the f1-score, including few-shot learning with opt-175b and existing text augmentation methods. we suggest that using classifiers with explicit human value input improves both inclusivity & explainability in ai.

2022-10-10

Dr Brendan Walker-Munro, Dr Zena Assaad
Abstract: as human science pushes the boundaries towards the development of artificial intelligence (ai), the sweep of progress has caused scholars and policymakers alike to question the legality of applying or utilising ai in various human endeavours. for example, debate has raged in international scholarship about the legitimacy of applying ai to weapon systems to form lethal autonomous weapon systems (laws). yet the argument holds true even when ai is applied to a military autonomous system that is not weaponised: how does one hold a machine accountable for a crime? what about a tort? can an artificial agent understand the moral and ethical content of its instructions? these are thorny questions, and in many cases these questions have been answered in the negative, as artificial entities lack any contingent moral agency. so what if the ai is not alone, but linked with or overseen by a human being, with their own moral and ethical understandings and obligations? who is responsible for any malfeasance that may be committed? does the human bear the legal risks of unethical or immoral decisions by an ai? these are some of the questions this manuscript seeks to engage with.

2022-10-09

Preethi Seshadri, Pouya Pezeshkpour, Sameer Singh
Abstract: recently, there has been an increase in efforts to understand how large language models (llms) propagate and amplify social biases. several works have utilized templates for fairness evaluation, which allow researchers to quantify social biases in the absence of test sets with protected attribute labels. while template evaluation can be a convenient and helpful diagnostic tool to understand model deficiencies, it often uses a simplistic and limited set of templates. in this paper, we study whether bias measurements are sensitive to the choice of templates used for benchmarking. specifically, we investigate the instability of bias measurements by manually modifying templates proposed in previous works in a semantically-preserving manner and measuring bias across these modifications. we find that bias values and resulting conclusions vary considerably across template modifications on four tasks, ranging from an 81% reduction (nli) to a 162% increase (mlm) in (task-specific) bias measurements. our results indicate that quantifying fairness in llms, as done in current practice, can be brittle and needs to be approached with more care and caution.

2022-10-07

Jwala Dhamala, Varun Kumar, Rahul Gupta, Kai-Wei Chang, Aram Galstyan
Abstract: several prior works have shown that language models (lms) can generate text containing harmful social biases and stereotypes. while decoding algorithms play a central role in determining properties of lm generated text, their impact on the fairness of the generations has not been studied. we present a systematic analysis of the impact of decoding algorithms on lm fairness, and analyze the trade-off between fairness, diversity and quality. our experiments with top-$p$, top-$k$ and temperature decoding algorithms, in open-ended language generation, show that fairness across demographic groups changes significantly with change in decoding algorithm's hyper-parameters. notably, decoding algorithms that output more diverse text also output more texts with negative sentiment and regard. we present several findings and provide recommendations on standardized reporting of decoding details in fairness evaluations and optimization of decoding algorithms for fairness alongside quality and diversity.
Kyra Yee, Alice Schoenauer Sebag, Olivia Redfield, Emily Sheng, Matthias Eck, Luca Belli
Abstract: harmful content detection models tend to have higher false positive rates for content from marginalized groups. in the context of marginal abuse modeling on twitter, such disproportionate penalization poses the risk of reduced visibility, where marginalized communities lose the opportunity to voice their opinion on the platform. current approaches to algorithmic harm mitigation, and bias detection for nlp models are often very ad hoc and subject to human bias. we make two main contributions in this paper. first, we design a novel methodology, which provides a principled approach to detecting and measuring the severity of potential harms associated with a text-based model. second, we apply our methodology to audit twitter's english marginal abuse model, which is used for removing amplification eligibility of marginally abusive content. without utilizing demographic labels or dialect classifiers, we are still able to detect and measure the severity of issues related to the over-penalization of the speech of marginalized communities, such as the use of reclaimed speech, counterspeech, and identity related terms. in order to mitigate the associated harms, we experiment with adding additional true negative examples and find that doing so provides improvements to our fairness metrics without large degradations in model performance.

2022-10-06

Vinodkumar Prabhakaran, Margaret Mitchell, Timnit Gebru, Iason Gabriel
Abstract: research on fairness, accountability, transparency and ethics of ai-based interventions in society has gained much-needed momentum in recent years. however it lacks an explicit alignment with a set of normative values and principles that guide this research and interventions. rather, an implicit consensus is often assumed to hold for the values we impart into our models - something that is at odds with the pluralistic world we live in. in this paper, we put forth the doctrine of universal human rights as a set of globally salient and cross-culturally recognized set of values that can serve as a grounding framework for explicit value alignment in responsible ai - and discuss its efficacy as a framework for civil society partnership and participation. we argue that a human rights framework orients the research in this space away from the machines and the risks of their biases, and towards humans and the risks to their rights, essentially helping to center the conversation around who is harmed, what harms they face, and how those harms may be mitigated.
Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki
Abstract: we study the relationship between task-agnostic intrinsic and task-specific extrinsic social bias evaluation measures for masked language models (mlms), and find that there exists only a weak correlation between these two types of evaluation measures. moreover, we find that mlms debiased using different methods still re-learn social biases during fine-tuning on downstream tasks. we identify the social biases in both training instances as well as their assigned labels as reasons for the discrepancy between intrinsic and extrinsic bias evaluation measurements. overall, our findings highlight the limitations of existing mlm bias evaluation measures and raise concerns on the deployment of mlms in downstream applications using those measures.
Loukas Ilias, Felix Soldner, Bennett Kleinberg
Abstract: people are regularly confronted with potentially deceptive statements (e.g., fake news, misleading product reviews, or lies about activities). only few works on automated text-based deception detection have exploited the potential of deep learning approaches. a critique of deep-learning methods is their lack of interpretability, preventing us from understanding the underlying (linguistic) mechanisms involved in deception. however, recent advancements have made it possible to explain some aspects of such models. this paper proposes and evaluates six deep-learning models, including combinations of bert (and roberta), multihead attention, co-attentions, and transformers. to understand how the models reach their decisions, we then examine the model's predictions with lime. we then zoom in on vocabulary uniqueness and the correlation of liwc categories with the outcome class (truthful vs deceptive). the findings suggest that our transformer-based models can enhance automated deception detection performances (+2.11% in accuracy) and show significant differences pertinent to the usage of liwc features in truthful and deceptive statements.
David Wingate, Mohammad Shoeybi, Taylor Sorensen
Abstract: we explore the idea of compressing the prompts used to condition language models, and show that compressed prompts can retain a substantive amount of information about the original prompt. for severely compressed prompts, while fine-grained information is lost, abstract information and general sentiments can be retained with surprisingly few parameters, which can be useful in the context of decode-time algorithms for controllability and toxicity reduction. we explore contrastive conditioning to steer language model generation towards desirable text and away from undesirable text, and find that some complex prompts can be effectively compressed into a single token to guide generation. we also show that compressed prompts are largely compositional, and can be constructed such that they can be used to control independent aspects of generated text.

2022-10-05

Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, Ajung Moon, Negar Rostamzadeh
Abstract: inappropriate design and deployment of machine learning (ml) systems leads to negative downstream social and ethical impact -- described here as social and ethical risks -- for users, society and the environment. despite the growing need to regulate ml systems, current processes for assessing and mitigating risks are disjointed and inconsistent. we interviewed 30 industry practitioners on their current social and ethical risk management practices, and collected their first reactions on adapting safety engineering frameworks into their practice -- namely, system theoretic process analysis (stpa) and failure mode and effects analysis (fmea). our findings suggest stpa/fmea can provide appropriate structure toward social and ethical risk assessment and mitigation processes. however, we also find nontrivial challenges in integrating such frameworks in the fast-paced culture of the ml industry. we call on the ml research community to strengthen existing frameworks and assess their efficacy, ensuring that ml systems are safer for all people.

2022-10-04

Zhijing Jin, Sydney Levine, Fernando Gonzalez, Ojasv Kamal, Maarten Sap, Mrinmaya Sachan, Rada Mihalcea, Josh Tenenbaum, Bernhard Schölkopf
Abstract: ai systems are becoming increasingly intertwined with human life. in order to effectively collaborate with humans and ensure safety, ai systems need to be able to understand, interpret and predict human moral judgments and decisions. human moral judgments are often guided by rules, but not always. a central challenge for ai safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. in this paper, we present a novel challenge set consisting of rule-breaking question answering (rbqa) of cases that involve potentially permissible rule-breaking -- inspired by recent moral psychology studies. using a state-of-the-art large language model (llm) as a basis, we propose a novel moral chain of thought (moralcot) prompting strategy that combines the strengths of llms with theories of moral reasoning developed in cognitive science to predict human moral judgments. moralcot outperforms seven existing llms by 6.2% f1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. we also conduct a detailed error analysis to suggest directions for future work to improve ai safety using rbqa. our data is open-sourced at https://huggingface.co/datasets/feradauto/moralexceptqa and code at https://github.com/feradauto/moralcot

2022-10-03

Tom Bewley, Jonathan Lawry, Arthur Richards, Rachel Craddock, Ian Henderson
Abstract: recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. we explore the merits of learning intrinsically interpretable tree models instead. we develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation.
Ke Shen, Mayank Kejriwal
Abstract: recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language understanding (nlu) problems, such as question answering (qa) and abductive reasoning. despite these advances, there is limited work still on understanding whether these models respond to perturbed multiple-choice instances in a sufficiently robust manner that would allow them to be trusted in real-world situations. we present four confusion probes, inspired by similar phenomena first identified in the behavioral science community, to test for problems such as prior bias and choice paralysis. experimentally, we probe a widely used transformer-based multiple-choice nlu system using four established benchmark datasets. here we show that the model exhibits significant prior bias and to a lesser, but still highly significant degree, choice paralysis, in addition to other problems. our results suggest that stronger testing protocols and additional benchmarks may be necessary before the language models are used in front-facing systems or decision making with real world consequences.

2022-10-02

Gavin Abercrombie, Verena Rieser
Abstract: conversational ai systems can engage in unsafe behaviour when handling users' medical queries that can have severe consequences and could even lead to deaths. systems therefore need to be capable of both recognising the seriousness of medical inputs and producing responses with appropriate levels of risk. we create a corpus of human written english language medical queries and the responses of different types of systems. we label these with both crowdsourced and expert annotations. while individual crowdworkers may be unreliable at grading the seriousness of the prompts, their aggregated labels tend to agree with professional opinion to a greater extent on identifying the medical queries and recognising the risk types posed by the responses. results of classification experiments suggest that, while these tasks can be automated, caution should be exercised, as errors can potentially be very serious.

2022-09-28

Francis Rhys Ward, Francesco Belardinelli, Francesca Toni
Abstract: we define a novel neuro-symbolic framework, argumentative reward learning, which combines preference-based argumentation with existing approaches to reinforcement learning from human feedback. our method improves prior work by generalising human preferences, reducing the burden on the user and increasing the robustness of the reward model. we demonstrate this with a number of experiments.
Amelia Glaese, Nat Mcaleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, Geoffrey Irving
Abstract: we present sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. we use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. first, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. we demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. for factual questions, evidence provided by sparrow supports the sampled response 78% of the time. sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.

2022-09-27

Abhilash Chakraborty, Anupam Biswas, Ajoy Kumar Khan
Abstract: with the advent of the digital era, every day-to-day task is automated due to technological advances. however, technology has yet to provide people with enough tools and safeguards. as the internet connects more-and-more devices around the globe, the question of securing the connected devices grows at an even spiral rate. data thefts, identity thefts, fraudulent transactions, password compromises, and system breaches are becoming regular everyday news. the surging menace of cyber-attacks got a jolt from the recent advancements in artificial intelligence. ai is being applied in almost every field of different sciences and engineering. the intervention of ai not only automates a particular task but also improves efficiency by many folds. so it is evident that such a scrumptious spread would be very appetizing to cybercriminals. thus the conventional cyber threats and attacks are now ``intelligent" threats. this article discusses cybersecurity and cyber threats along with both conventional and intelligent ways of defense against cyber-attacks. furthermore finally, end the discussion with the potential prospects of the future of ai in cybersecurity.

2022-09-24

Nanyun Peng
Abstract: recent advances in large pre-trained language models have demonstrated strong results in generating natural languages and significantly improved performances for many natural language generation (nlg) applications such as machine translation and text summarization. however, when the generation tasks are more open-ended and the content is under-specified, existing techniques struggle to generate long-term coherent and creative content. moreover, the models exhibit and even amplify social biases that are learned from the training corpora. this happens because the generation models are trained to capture the surface patterns (i.e. sequences of words), instead of capturing underlying semantics and discourse structures, as well as background knowledge including social norms. in this paper, i introduce our recent works on controllable text generation to enhance the creativity and fairness of language generation models. we explore hierarchical generation and constrained decoding, with applications to creative language generation including story, poetry, and figurative languages, and bias mitigation for generation models.

2022-09-23

Valdemar Danry, Pat Pataranutaporn, Ziv Epstein, Matthew Groh, Pattie Maes
Abstract: the ability to discern between true and false information is essential to making sound decisions. however, with the recent increase in ai-based disinformation campaigns, it has become critical to understand the influence of deceptive systems on human information processing. in experiment (n=128), we investigated how susceptible people are to deceptive ai systems by examining how their ability to discern true news from fake news varies when ai systems are perceived as either human fact-checkers or ai fact-checking systems, and when explanations provided by those fact-checkers are either deceptive or honest. we find that deceitful explanations significantly reduce accuracy, indicating that people are just as likely to believe deceptive ai explanations as honest ai explanations. although before getting assistance from an ai-system, people have significantly higher weighted discernment accuracy on false headlines than true headlines, we found that with assistance from an ai system, discernment accuracy increased significantly when given honest explanations on both true headlines and false headlines, and decreased significantly when given deceitful explanations on true headlines and false headlines. further, we did not observe any significant differences in discernment between explanations perceived as coming from a human fact checker compared to an ai-fact checker. similarly, we found no significant differences in trust. these findings exemplify the dangers of deceptive ai systems and the need for finding novel ways to limit their influence human information processing.

2022-09-21

Hannah Rose Kirk, Bertie Vidgen, Scott A. Hale
Abstract: annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. however, most machine learning research has prioritized maximizing effectiveness (i.e., f1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). in this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. this approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.
Benjamin S. Bucknall, Shiri Dori-Hacohen
Abstract: there is a substantial and ever-growing corpus of evidence and literature exploring the impacts of artificial intelligence (ai) technologies on society, politics, and humanity as a whole. a separate, parallel body of work has explored existential risks to humanity, including but not limited to that stemming from unaligned artificial general intelligence (agi). in this paper, we problematise the notion that current and near-term artificial intelligence technologies have the potential to contribute to existential risk by acting as intermediate risk factors, and that this potential is not limited to the unaligned agi scenario. we propose the hypothesis that certain already-documented effects of ai can act as existential risk factors, magnifying the likelihood of previously identified sources of existential risk. moreover, future developments in the coming decade hold the potential to significantly exacerbate these risk factors, even in the absence of artificial general intelligence. our main contribution is a (non-exhaustive) exposition of potential ai risk factors and the causal relationships between them, focusing on how ai can affect power dynamics and information security. this exposition demonstrates that there exist causal pathways from ai systems to existential risks that do not presuppose hypothetical future ai capabilities.

2022-09-14

Martin Strobel, Reza Shokri
Abstract: the privacy risks of machine learning models is a major concern when training them on sensitive and personal data. we discuss the tradeoffs between data privacy and the remaining goals of trustworthy machine learning (notably, fairness, robustness, and explainability).

2022-09-08

Neeraja Kirtane, V Manushree, Aditya Kane
Abstract: the gender bias present in the data on which language models are pre-trained gets reflected in the systems that use these models. the model's intrinsic gender bias shows an outdated and unequal view of women in our culture and encourages discrimination. therefore, in order to establish more equitable systems and increase fairness, it is crucial to identify and mitigate the bias existing in these models. while there is a significant amount of work in this area in english, there is a dearth of research being done in other gendered and low resources languages, particularly the indian languages. english is a non-gendered language, where it has genderless nouns. the methodologies for bias detection in english cannot be directly deployed in other gendered languages, where the syntax and semantics vary. in our paper, we measure gender bias associated with occupations in hindi language models. our major contributions in this paper are the construction of a novel corpus to evaluate occupational gender bias in hindi, quantify this existing bias in these systems using a well-defined metric, and mitigate it by efficiently fine-tuning our model. our results reflect that the bias is reduced post-introduction of our proposed mitigation techniques. our codebase is available publicly.
Daniel Schlör, Andreas Hotho
Abstract: the early development and deployment of hospital and healthcare information systems have encouraged the ongoing digitization of processes in hospitals. many of these processes, which previously required paperwork and telephone arrangements, are now integrated into it solutions and require physicians and medical staff to interact with appropriate interfaces and tools. although this shift to digital data management and process support has benefited patient care in many ways, it requires physicians to accurately capture all relevant information digitally for billing and documentation purposes, which takes a lot of time away from actual patient care work. however, systematic collection of healthcare data over a long period of time offers opportunities to improve this process and support medical staff by introducing recommender systems. based on a practical working example, in this position paper, we will outline the design of a responsible recommender system in the medical context from a technical, application driven perspective and discuss potential design choices and criteria with a specific focus on accountability, safety, and fairness.

2022-09-07

Wai Man Si, Michael Backes, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, Savvas Zannettou, Yang Zhang
Abstract: chatbots are used in many applications, e.g., automated agents, smart home assistants, interactive characters in online games, etc. therefore, it is crucial to ensure they do not behave in undesired manners, providing offensive or toxic responses to users. this is not a trivial task as state-of-the-art chatbot models are trained on large, public datasets openly collected from the internet. this paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. we show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. even more worryingly, some non-toxic queries can trigger toxic responses too. we then set out to design and experiment with an attack, toxicbuddy, which relies on fine-tuning gpt-2 to generate non-toxic queries that make chatbots respond in a toxic manner. our extensive experimental evaluation demonstrates that our attack is effective against public chatbot models and outperforms manually-crafted malicious queries proposed by previous work. we also evaluate three defense mechanisms against toxicbuddy, showing that they either reduce the attack performance at the cost of affecting the chatbot's utility or are only effective at mitigating a portion of the attack. this highlights the need for more research from the computer security and online safety communities to ensure that chatbot models do not hurt their users. overall, we are confident that toxicbuddy can be used as an auditing tool and that our work will pave the way toward designing more effective defenses for chatbot safety.
Yi Cai, Arthur Zimek, Gerhard Wunder, Eirini Ntoutsi
Abstract: hate speech detection is a common downstream application of natural language processing (nlp) in the real world. in spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. the deployment of biased models could further enhance the existing social biases. but unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. a popular solution for improving machine learning fairness in nlp is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. in addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time. to this end, we propose an automatic misuse detector (mid) relying on an explanation method for detecting potential bias. and built upon that, an end-to-end debiasing framework with the proposed staged correction is designed for text classifiers without any external resources required.

2022-09-05

Yundi Shi, Piji Li, Changchun Yin, Zhaoyang Han, Lu Zhou, Zhe Liu
Abstract: as the pre-trained language models (plms) continue to grow, so do the hardware and data requirements for fine-tuning plms. therefore, the researchers have come up with a lighter method called \textit{prompt learning}. however, during the investigations, we observe that the prompt learning methods are vulnerable and can easily be attacked by some illegally constructed prompts, resulting in classification errors, and serious security problems for plms. most of the current research ignores the security issue of prompt-based methods. therefore, in this paper, we propose a malicious prompt template construction method (\textbf{promptattack}) to probe the security performance of plms. several unfriendly template construction approaches are investigated to guide the model to misclassify the task. extensive experiments on three datasets and three plms prove the effectiveness of our proposed approach promptattack. we also conduct experiments to verify that our method is applicable in few-shot scenarios.
Hezekiah J. Branch, Jonathan Rodriguez Cefalu, Jeremy Mchugh, Leyla Hujer, Aditya Bahl, Daniel Del Castillo Iglesias, Ron Heichman, Ramesh Darwishi
Abstract: recent advances in the development of large language models have resulted in public access to state-of-the-art pre-trained language models (plms), including generative pre-trained transformer 3 (gpt-3) and bidirectional encoder representations from transformers (bert). however, evaluations of plms, in practice, have shown their susceptibility to adversarial attacks during the training and fine-tuning stages of development. such attacks can result in erroneous outputs, model-generated hate speech, and the exposure of users' sensitive information. while existing research has focused on adversarial attacks during either the training or the fine-tuning of plms, there is a deficit of information on attacks made between these two development phases. in this work, we highlight a major security vulnerability in the public release of gpt-3 and further investigate this vulnerability in other state-of-the-art plms. we restrict our work to pre-trained models that have not undergone fine-tuning. further, we underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures. following this approach, we observe a significant decrease in text classification quality when evaluating for semantic similarity.

2022-09-01

Marina Escobar-Planas, Emilia Gómez, Carlos-D Martínez-Hinarejos
Abstract: conversational agents (cas) embodied in speakers or chatbots are becoming very popular in some countries, and despite their adult-centred design, they have become part of children's lives, generating a need for children-centric trustworthy systems. this paper presents a literature review to identify the main opportunities, challenges and risks brought by cas when used by children. we then consider relevant ethical guidelines for ai and adapt them to this particular system and population, using a delphi methodology with a set of experts from different disciplines. from this analysis, we propose specific guidelines to help cas developers improve their design towards trustworthiness and children.

2022-08-31

Zhibo Zhang, Hussam Al Hamadi, Ernesto Damiani, Chan Yeob Yeun, Fatma Taher
Abstract: this survey presents a comprehensive review of current literature on explainable artificial intelligence (xai) methods for cyber security applications. due to the rapid development of internet-connected systems and artificial intelligence in recent years, artificial intelligence including machine learning (ml) and deep learning (dl) has been widely utilized in the fields of cyber security including intrusion detection, malware detection, and spam filtering. however, although artificial intelligence-based approaches for the detection and defense of cyber attacks and threats are more advanced and efficient compared to the conventional signature-based and rule-based cyber security strategies, most ml-based techniques and dl-based techniques are deployed in the black-box manner, meaning that security experts and customers are unable to explain how such procedures reach particular conclusions. the deficiencies of transparency and interpretability of existing artificial intelligence techniques would decrease human users' confidence in the models utilized for the defense against cyber attacks, especially in current situations where cyber attacks become increasingly diverse and complicated. therefore, it is essential to apply xai in the establishment of cyber security models to create more explainable models while maintaining high accuracy and allowing human users to comprehend, trust, and manage the next generation of cyber defense mechanisms. although there are papers reviewing artificial intelligence applications in cyber security areas and the vast literature on applying xai in many fields including healthcare, financial services, and criminal justice, the surprising fact is that there are currently no survey research articles that concentrate on xai applications in cyber security.

2022-08-30

Hua Lu, Siqi Bao, Huang He, Fan Wang, Hua Wu, Haifeng Wang
Abstract: many open-domain dialogue models pre-trained with social media comments can generate coherent replies but have difficulties producing engaging responses when interacting with real users. this phenomenon might mainly result from the deficiency of annotated human-human conversations and the misalignment with human preference. in this paper, we propose a novel and efficient approach diamante to boost the open-domain chatbot, where two kinds of human feedback (including explicit demonstration and implicit preference) are collected and leveraged. by asking annotators to select or amend the model-generated candidate responses, diamante efficiently collects the human demonstrated responses and constructs a chinese chit-chat dataset. to enhance the alignment with human preference, diamante leverages the implicit preference in the data collection process and introduces the generation-evaluation joint training. comprehensive experiments indicate that the diamante dataset and joint training paradigm can significantly boost the performance of chinese pre-trained dialogue models.
Bettina Könighofer, Roderick Bloem, Rüdiger Ehlers, Christian Pek
Abstract: runtime enforcement refers to the theories, techniques, and tools for enforcing correct behavior with respect to a formal specification of systems at runtime. in this paper, we are interested in techniques for constructing runtime enforcers for the concrete application domain of enforcing safety in ai. we discuss how safety is traditionally handled in the field of ai and how more formal guarantees on the safety of a self-learning agent can be given by integrating a runtime enforcer. we survey a selection of work on such enforcers, where we distinguish between approaches for discrete and continuous action spaces. the purpose of this paper is to foster a better understanding of advantages and limitations of different enforcement techniques, focusing on the specific challenges that arise due to their application in ai. finally, we present some open challenges and avenues for future work.
Francesco Sovrano, Giulio Masetti
Abstract: the ai act has been recently proposed by the european commission to regulate the use of ai in the eu, especially on high-risk applications, i.e. systems intended to be used as safety components in the management and operation of road traffic and the supply of water, gas, heating and electricity. on the other hand, iec 61508, one of the most adopted international standards for safety-critical electronic components, seem to mostly forbid the use of ai in such systems. given this conflict between iec 61508 and the proposed ai act, also stressed by the fact that iec 61508 is not an harmonised european standard, with the present paper we study and analyse what is going to happen to industry after the entry into force of the ai act. in particular, we focus on how the proposed ai act might positively impact on the sustainability of critical infrastructures by allowing the use of ai on an industry where it was previously forbidden. to do so, we provide several examples of ai-based solutions falling under the umbrella of iec 61508 that might have a positive impact on sustainability in alignment with the current long-term goals of the eu and the sustainable development goals of the united nations, i.e., affordable and clean energy, sustainable cities and communities.
Stephen Mcaleese
Abstract: superhuman artificial general intelligence could be created this century and would likely be a significant source of existential risk. delaying the creation of superintelligent ai (asi) could decrease total existential risk by increasing the amount of time humanity has to work on the ai alignment problem. however, since asi could reduce most risks, delaying the creation of asi could also increase other existential risks, especially from advanced future technologies such as synthetic biology and molecular nanotechnology. if ai existential risk is high relative to the sum of other existential risk, delaying the creation of asi will tend to decrease total existential risk and vice-versa. other factors such as war and a hardware overhang could increase ai risk and cognitive enhancement could decrease ai risk. to reduce total existential risk, humanity should take robustly positive actions such as working on existential risk analysis, ai governance and safety, and reducing all sources of existential risk by promoting differential technological development.

2022-08-28

Kristine Gloria, Nidhi Rastogi, Stevie Degroff
Abstract: today's large-scale algorithmic and automated deployment of decision-making systems threatens to exclude marginalized communities. thus, the emergent danger comes from the effectiveness and the propensity of such systems to replicate, reinforce, or amplify harmful existing discriminatory acts. algorithmic bias exposes a deeply entrenched encoding of a range of unwanted biases that can have profound real-world effects that manifest in domains from employment, to housing, to healthcare. the last decade of research and examples on these effects further underscores the need to examine any claim of a value-neutral technology. this work examines the intersection of algorithmic bias in consumer mobile health technologies (mhealth). we include mhealth, a term used to describe mobile technology and associated sensors to provide healthcare solutions through patient journeys. we also include mental and behavioral health (mental and physiological) as part of our study. furthermore, we explore to what extent current mechanisms - legal, technical, and or normative - help mitigate potential risks associated with unwanted bias in intelligent systems that make up the mhealth domain. we provide additional guidance on the role and responsibilities technologists and policymakers have to ensure that such systems empower patients equitably.

2022-08-23

Joseph Aylett-Bullock, Miguel Luengo-Oroz
Abstract: ai is being increasingly used to aid response efforts to humanitarian emergencies at multiple levels of decision-making. such ai systems are generally understood to be stand-alone tools for decision support, with ethical assessments, guidelines and frameworks applied to them through this lens. however, as the prevalence of ai increases in this domain, such systems will begin to encounter each other through information flow networks created by interacting decision-making entities, leading to multi-ai complex systems which are often ill understood. in this paper we describe how these multi-ai systems can arise, even in relatively simple real-world humanitarian response scenarios, and lead to potentially emergent and erratic erroneous behavior. we discuss how we can better work towards more trustworthy multi-ai systems by exploring some of the associated challenges and opportunities, and how we can design better mechanisms to understand and assess such systems. this paper is designed to be a first exposition on this topic in the field of humanitarian response, raising awareness, exploring the possible landscape of this domain, and providing a starting point for future work within the wider community.

2022-08-22

Emily Mcmilin
Abstract: in this paper we motivate the causal mechanisms behind sample selection induced collider bias (selection collider bias) that can cause large language models (llms) to learn unconditional dependence between entities that are unconditionally independent in the real world. we show that selection collider bias can become amplified in underspecified learning tasks, and although difficult to overcome, we describe a method to exploit the resulting spurious correlations for determination of when a model may be uncertain about its prediction. we demonstrate an uncertainty metric that matches human uncertainty in tasks with gender pronoun underspecification on an extended version of the winogender schemas evaluation set, and we provide an online demo where users can apply our uncertainty metric to their own texts and models.

2022-08-21

Jan Jezabek, Akash Singh
Abstract: protecting nlp models against misspellings whether accidental or adversarial has been the object of research interest for the past few years. existing remediations have typically either compromised accuracy or required full model re-training with each new class of attacks. we propose a novel method of retroactively adding resilience to misspellings to transformer-based nlp models. this robustness can be achieved without the need for re-training of the original nlp model and with only a minimal loss of language understanding performance on inputs without misspellings. additionally we propose a new efficient approximate method of generating adversarial misspellings, which significantly reduces the cost needed to evaluate a model's resilience to adversarial attacks.

2022-08-17

Rasmus Adler, Michael Klaes
Abstract: the european machinery directive and related harmonized standards do consider that software is used to generate safety-relevant behavior of the machinery but do not consider all kinds of software. in particular, software based on machine learning (ml) are not considered for the realization of safety-relevant behavior. this limits the introduction of suitable safety concepts for autonomous mobile robots and other autonomous machinery, which commonly depend on ml-based functions. we investigated this issue and the way safety standards define safety measures to be implemented against software faults. functional safety standards use safety integrity levels (sils) to define which safety measures shall be implemented. they provide rules for determining the sil and rules for selecting safety measures depending on the sil. in this paper, we argue that this approach can hardly be adopted with respect to ml and other kinds of artificial intelligence (ai). instead of simple rules for determining an sil and applying related measures against faults, we propose the use of assurance cases to argue that the individually selected and applied measures are sufficient in the given case. to get a first rating regarding the feasibility and usefulness of our proposal, we presented and discussed it in a workshop with experts from industry, german statutory accident insurance companies, work safety and standardization commissions, and representatives from various national, european, and international working groups dealing with safety and ai. in this paper, we summarize the proposal and the workshop discussion. moreover, we check to which extent our proposal is in line with the european ai act proposal and current safety standardization initiatives addressing ai and autonomous systems

2022-08-15

Brendon G. Anderson, Tanmay Gautam, Somayeh Sojoudi
Abstract: in this discussion paper, we survey recent research surrounding robustness of machine learning models. as learning algorithms become increasingly more popular in data-driven control systems, their robustness to data uncertainty must be ensured in order to maintain reliable safety-critical operations. we begin by reviewing common formalisms for such robustness, and then move on to discuss popular and state-of-the-art techniques for training robust machine learning models as well as methods for provably certifying such robustness. from this unification of robust machine learning, we identify and discuss pressing directions for future research in the area.
Chuyen Nguyen, Caleb Morgan, Sudip Mittal
Abstract: as the practicality of artificial intelligence (ai) and machine learning (ml) based techniques grow, there is an ever increasing threat of adversarial attacks. there is a need to red team this ecosystem to identify system vulnerabilities, potential threats, characterize properties that will enhance system robustness, and encourage the creation of effective defenses. a secondary need is to share this ai security threat intelligence between different stakeholders like, model developers, users, and ai/ml security professionals. in this paper, we create and describe a prototype system cti4ai, to overcome the need to methodically identify and share ai/ml specific vulnerabilities and threat intelligence.

2022-08-08

Babak Hemmatian, Lav R. Varshney
Abstract: recent work demonstrates a bias in the gpt-3 model towards generating violent text completions when prompted about muslims, compared with christians and hindus. two pre-registered replication attempts, one exact and one approximate, found only the weakest bias in the more recent instruct series version of gpt-3, fine-tuned to eliminate biased and toxic outputs. few violent completions were observed. additional pre-registered experiments, however, showed that using common names associated with the religions in prompts yields a highly significant increase in violent completions, also revealing a stronger second-order bias against muslims. names of muslim celebrities from non-violent domains resulted in relatively fewer violent completions, suggesting that access to individualized information can steer the model away from using stereotypes. nonetheless, content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. our results show the need for additional debiasing of large language models to address higher-order schemas and associations.
Arvind Subramaniam, Aryan Mehra, Sayani Kundu
Abstract: hate speech takes many forms to target communities with derogatory comments, and takes humanity a step back in societal progress. hatexplain is a recently published and first dataset to use annotated spans in the form of rationales, along with speech classification categories and targeted communities to make the classification more humanlike, explainable, accurate and less biased. we tune bert to perform this task in the form of rationales and class prediction, and compare our performance on different metrics spanning across accuracy, explainability and bias. our novelty is threefold. firstly, we experiment with the amalgamated rationale class loss with different importance values. secondly, we experiment extensively with the ground truth attention values for the rationales. with the introduction of conservative and lenient attentions, we compare performance of the model on hatexplain and test our hypothesis. thirdly, in order to improve the unintended bias in our models, we use masking of the target community words and note the improvement in bias and explainability metrics. overall, we are successful in achieving model explanability, bias removal and several incremental improvements on the original bert implementation.

2022-08-05

Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, Morteza Behrooz, William Ngan, Spencer Poff, Naman Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kambadur, Jason Weston
Abstract: we present blenderbot 3, a 175b parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. we release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. this technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (roller et al., 2021; komeili et al., 2022). finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. the goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction.
Da Ju, Jing Xu, Y-Lan Boureau, Jason Weston
Abstract: the promise of interaction between intelligent conversational agents and humans is that models can learn from such feedback in order to improve. unfortunately, such exchanges in the wild will not always involve human utterances that are benign or of high quality, and will include a mixture of engaged (helpers) and unengaged or even malicious users (trolls). in this work we study how to perform robust learning in such an environment. we introduce a benchmark evaluation, safetymix, which can evaluate methods that learn safe vs. toxic language in a variety of adversarial settings to test their robustness. we propose and analyze several mitigating learning algorithms that identify trolls either at the example or at the user level. our main finding is that user-based methods, that take into account that troll users will exhibit adversarial behavior across multiple examples, work best in a variety of settings on our benchmark. we then test these methods in a further real-life setting of conversations collected during deployment, with similar results.

2022-08-02

Bran Knowles, "Jason D'Cruz", John T. Richards, Kush R. Varshney
Abstract: it is curious that ai increasingly outperforms human decision makers, yet much of the public distrusts ai to make decisions affecting their lives. in this paper we explore a novel theory that may explain one reason for this. we propose that public distrust of ai is a moral consequence of designing systems that prioritize reduction of costs of false positives over less tangible costs of false negatives. we show that such systems, which we characterize as 'distrustful', are more likely to miscategorize trustworthy individuals, with cascading consequences to both those individuals and the overall human-ai trust relationship. ultimately, we argue that public distrust of ai stems from well-founded concern about the potential of being miscategorized. we propose that restoring public trust in ai will require that systems are designed to embody a stance of 'humble trust', whereby the moral costs of the misplaced distrust associated with false negatives is weighted appropriately during development and use.
Dhanasekar Sundararaman, Vivek Subramanian
Abstract: biases in culture, gender, ethnicity, etc. have existed for decades and have affected many areas of human social interaction. these biases have been shown to impact machine learning (ml) models, and for natural language processing (nlp), this can have severe consequences for downstream tasks. mitigating gender bias in information retrieval (ir) is important to avoid propagating stereotypes. in this work, we employ a dataset consisting of two components: (1) relevance of a document to a query and (2) "gender" of a document, in which pronouns are replaced by male, female, and neutral conjugations. we definitively show that pre-trained models for ir do not perform well in zero-shot retrieval tasks when full fine-tuning of a large pre-trained bert encoder is performed and that lightweight fine-tuning performed with adapter networks improves zero-shot retrieval performance almost by 20% over baseline. we also illustrate that pre-trained models have gender biases that result in retrieved articles tending to be more often male than female. we overcome this by introducing a debiasing technique that penalizes the model when it prefers males over females, resulting in an effective model that retrieves articles in a balanced fashion across genders.

2022-08-01

Bran Knowles, John T. Richards, Frens Kroeger
Abstract: efforts to promote fairness, accountability, and transparency are assumed to be critical in fostering trust in ai (tai), but extant literature is frustratingly vague regarding this 'trust'. the lack of exposition on trust itself suggests that trust is commonly understood, uncomplicated, or even uninteresting. but is it? our analysis of tai publications reveals numerous orientations which differ in terms of who is doing the trusting (agent), in what (object), on the basis of what (basis), in order to what (objective), and why (impact). we develop an ontology that encapsulates these key axes of difference to a) illuminate seeming inconsistencies across the literature and b) more effectively manage a dizzying number of tai considerations. we then reflect this ontology through a corpus of publications exploring fairness, accountability, and transparency to examine the variety of ways that tai is considered within and between these approaches to promoting trust.

2022-07-27

Andrew J Lohn, Krystal Alex Jackson
Abstract: we aim to demonstrate the value of mathematical models for policy debates about technological progress in cybersecurity by considering phishing, vulnerability discovery, and the dynamics between patching and exploitation. we then adjust the inputs to those mathematical models to match some possible advances in their underlying technology. we find that ai's impact on phishing may be overestimated but could lead to more attacks going undetected. advances in vulnerability discovery have the potential to help attackers more than defenders. and automation that writes exploits is more useful to attackers than automation that writes patches, although advances that help deploy patches faster have the potential to be more impactful than either.

2022-07-25

Alex Andrew, Sam Spillard, Joshua Collyer, Neil Dhir
Abstract: in this paper we explore cyber security defence, through the unification of a novel cyber security simulator with models for (causal) decision-making through optimisation. particular attention is paid to a recently published approach: dynamic causal bayesian optimisation (dcbo). we propose that dcbo can act as a blue agent when provided with a view of a simulated network and a causal model of how a red agent spreads within that network. to investigate how dcbo can perform optimal interventions on host nodes, in order to reduce the cost of intrusions caused by the red agent. through this we demonstrate a complete cyber-simulation system, which we use to generate observational data for dcbo and provide numerical quantitative results which lay the foundations for future work in this space.
Heidy Khlaaf, Pamela Mishkin, Joshua Achiam, Gretchen Krueger, Miles Brundage
Abstract: codex, a large language model (llm) trained on a variety of codebases, exceeds the previous state of the art in its capacity to synthesize and generate code. although codex provides a plethora of benefits, models that may generate code on such scale have significant limitations, alignment problems, the potential to be misused, and the possibility to increase the rate of progress in technical fields that may themselves have destabilizing impacts or have misuse potential. yet such safety impacts are not yet known or remain to be explored. in this paper, we outline a hazard analysis framework constructed at openai to uncover hazards or safety risks that the deployment of models like codex may impose technically, socially, politically, and economically. the analysis is informed by a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts, and their capability to understand and execute them relative to human ability.

2022-07-23

Andrew Hundt, William Agnew, Vicky Zeng, Severin Kacianka, Matthew Gombolay
Abstract: stereotypes, bias, and discrimination have been extensively documented in machine learning (ml) methods such as computer vision (cv) [18, 80], natural language processing (nlp) [6], or both, in the case of large image and caption models such as openai clip [14]. in this paper, we evaluate how ml bias manifests in robots that physically and autonomously act within the world. we audit one of several recently published clip-powered robotic manipulation methods, presenting it with objects that have pictures of human faces on the surface which vary across race and gender, alongside task descriptions that contain terms associated with common stereotypes. our experiments definitively show robots acting out toxic stereotypes with respect to gender, race, and scientifically-discredited physiognomy, at scale. furthermore, the audited methods are less likely to recognize women and people of color. our interdisciplinary sociotechnical analysis synthesizes across fields and applications such as science technology and society (sts), critical studies, history, safety, robotics, and ai. we find that robots powered by large datasets and dissolution models (sometimes called "foundation models", e.g. clip) that contain humans risk physically amplifying malignant stereotypes in general; and that merely correcting disparities will be insufficient for the complexity and scale of the problem. instead, we recommend that robot learning methods that physically manifest stereotypes or other harmful outcomes be paused, reworked, or even wound down when appropriate, until outcomes can be proven safe, effective, and just. finally, we discuss comprehensive policy changes and the potential of new interdisciplinary research on topics like identity safety assessment frameworks and design justice to better understand and address these harms.

2022-07-20

Arash Bateni, Matthew C. Chan, Ray Eitel-Porter
Abstract: this paper summarizes and evaluates various approaches, methods, and techniques for pursuing fairness in artificial intelligence (ai) systems. it examines the merits and shortcomings of these measures and proposes practical guidelines for defining, measuring, and preventing bias in ai. in particular, it cautions against some of the simplistic, yet common, methods for evaluating bias in ai systems, and offers more sophisticated and effective alternatives. the paper also addresses widespread controversies and confusions in the field by providing a common language among different stakeholders of high-impact ai systems. it describes various trade-offs involving ai fairness, and provides practical recommendations for balancing them. it offers techniques for evaluating the costs and benefits of fairness targets, and defines the role of human judgment in setting these targets. this paper provides discussions and guidelines for ai practitioners, organization leaders, and policymakers, as well as various links to additional materials for a more technical audience. numerous real-world examples are provided to clarify the concepts, challenges, and recommendations from a practical perspective.
Oskar Van Der Wal, Jaap Jumelet, Katrin Schulz, Willem Zuidema
Abstract: detecting and mitigating harmful biases in modern language models are widely recognized as crucial, open problems. in this paper, we take a step back and investigate how language models come to be biased in the first place. we use a relatively small language model, using the lstm architecture trained on an english wikipedia corpus. with full access to the data and to the model parameters as they change during every step while training, we can map in detail how the representation of gender develops, what patterns in the dataset drive this, and how the model's internal state relates to the bias in a downstream task (semantic textual similarity). we find that the representation of gender is dynamic and identify different phases during training. furthermore, we show that gender information is represented increasingly locally in the input embeddings of the model and that, as a consequence, debiasing these can be effective in reducing the downstream bias. monitoring the training dynamics, allows us to detect an asymmetry in how the female and male gender are represented in the input embeddings. this is important, as it may cause naive mitigation strategies to introduce new undesirable biases. we discuss the relevance of the findings for mitigation strategies more generally and the prospects of generalizing our methods to larger language models, the transformer architecture, other languages and other undesirable biases.

2022-07-18

Emily Mcmilin
Abstract: in this work we show how large language models (llms) can learn statistical dependencies between otherwise unconditionally independent variables due to dataset selection bias. to demonstrate the effect, we developed a masked gender task that can be applied to bert-family models to reveal spurious correlations between predicted gender pronouns and a variety of seemingly gender-neutral variables like date and location, on pre-trained (unmodified) bert and roberta large models. finally, we provide an online demo, inviting readers to experiment further.

2022-07-17

Christopher D. Wallbridge, Qiyuan Zhang
Abstract: this extended abstract introduces the initial steps taken to develop a system for rapid internal simulation of knowledge (risk). risk aims to enable more transparency in artificial intelligence systems, especially those created by deep learning networks by allowing real-time simulation of what the system knows. by looking at hypothetical situations based on these simulations a system may make more informed decisions, and produce them for non-expert observers to understand the reasoning behind a given action.

2022-07-16

Hans Dermot Doran
Abstract: in this relatively informal discussion-paper we summarise issues in the domains of safety and security in machine learning that will affect industry sectors in the next five to ten years. various products using neural network classification, most often in vision related applications but also in predictive maintenance, have been researched and applied in real-world applications in recent years. nevertheless, reports of underlying problems in both safety and security related domains, for instance adversarial attacks have unsettled early adopters and are threatening to hinder wider scale adoption of this technology. the problem for real-world applicability lies in being able to assess the risk of applying these technologies. in this discussion-paper we describe the process of arriving at a machine-learnt neural network classifier pointing out safety and security vulnerabilities in that workflow, citing relevant research where appropriate.

2022-07-14

Rita Sevastjanova, Mennatallah El-Assady
Abstract: language models learn and represent language differently than humans; they learn the form and not the meaning. thus, to assess the success of language model explainability, we need to consider the impact of its divergence from a user's mental model of language. in this position paper, we argue that in order to avoid harmful rationalization and achieve truthful understanding of language models, explanation processes must satisfy three main conditions: (1) explanations have to truthfully represent the model behavior, i.e., have a high fidelity; (2) explanations must be complete, as missing information distorts the truth; and (3) explanations have to take the user's mental model into account, progressively verifying a person's knowledge and adapting their understanding. we introduce a decision tree model to showcase potential reasons why current explanations fail to reach their objectives. we further emphasize the need for human-centered design to explain the model from multiple perspectives, progressively adapting explanations to changing user expectations.

2022-07-10

Pieter Delobelle, Bettina Berendt
Abstract: large pre-trained language models are successfully being used in a variety of tasks, across many languages. with this ever-increasing usage, the risk of harmful side effects also rises, for example by reproducing and reinforcing stereotypes. however, detecting and mitigating these harms is difficult to do in general and becomes computationally expensive when tackling multiple languages or when considering different biases. to address this, we present fairdistillation: a cross-lingual method based on knowledge distillation to construct smaller language models while controlling for specific biases. we found that our distillation method does not negatively affect the downstream performance on most tasks and successfully mitigates stereotyping and representational harms. we demonstrate that fairdistillation can create fairer language models at a considerably lower cost than alternative approaches.

2022-07-08

Shaina Raza, Deepak John Reji, Dora D. Liu, Syed Raza Bashir, Usman Naseem
Abstract: recommender systems, information retrieval, and other information access systems present unique challenges for examining and applying concepts of fairness and bias mitigation in unstructured text. this paper introduces dbias, which is a python package to ensure fairness in news articles. dbias is a trained machine learning (ml) pipeline that can take a text (e.g., a paragraph or news story) and detects if the text is biased or not. then, it detects the biased words in the text, masks them, and recommends a set of sentences with new words that are bias-free or at least less biased. we incorporate the elements of data science best practices to ensure that this pipeline is reproducible and usable. we show in experiments that this pipeline can be effective for mitigating biases and outperforms the common neural network architectures in ensuring fairness in the news articles.

2022-07-06

Przemyslaw Joniak, Akiko Aizawa
Abstract: language model debiasing has emerged as an important field of study in the nlp community. numerous debiasing techniques were proposed, but bias ablation remains an unaddressed issue. we demonstrate a novel framework for inspecting bias in pre-trained transformer-based language models via movement pruning. given a model and a debiasing objective, our framework finds a subset of the model containing less bias than the original model. we implement our framework by pruning the model while fine-tuning it on the debiasing objective. optimized are only the pruning scores - parameters coupled with the model's weights that act as gates. we experiment with pruning attention heads, an important building block of transformers: we prune square blocks, as well as establish a new way of pruning the entire heads. lastly, we demonstrate the usage of our framework using gender bias, and based on our findings, we propose an improvement to an existing debiasing method. additionally, we re-discover a bias-performance trade-off: the better the model performs, the more bias it contains.

2022-07-03

Yi Zhang, Junyang Wang, Jitao Sang
Abstract: vision-language pre-training (vlp) models have achieved state-of-the-art performance in numerous cross-modal tasks. since they are optimized to capture the statistical properties of intra- and inter-modality, there remains risk to learn social biases presented in the data as well. in this work, we (1) introduce a counterfactual-based bias measurement \emph{counterbias} to quantify the social bias in vlp models by comparing the [mask]ed prediction probabilities of factual and counterfactual samples; (2) construct a novel vl-bias dataset including 24k image-text pairs for measuring gender bias in vlp models, from which we observed that significant gender bias is prevalent in vlp models; and (3) propose a vlp debiasing method \emph{fairvlp} to minimize the difference in the [mask]ed prediction probabilities between factual and counterfactual image-text pairs for vlp debiasing. although counterbias and fairvlp focus on social bias, they are generalizable to serve as tools and provide new insights to probe and regularize more knowledge in vlp models.

2022-07-02

Travis Lacroix
Abstract: the value-alignment problem for artificial intelligence (ai) asks how we can ensure that the 'values' (i.e., objective functions) of artificial systems are aligned with the values of humanity. in this paper, i argue that linguistic communication (natural language) is a necessary condition for robust value alignment. i discuss the consequences that the truth of this claim would have for research programmes that attempt to ensure value alignment for ai systems; or, more loftily, designing robustly beneficial or ethical artificial agents.

2022-07-01

Kaspar Rosager Ludvigsen, Shishir Nagaraja, Angela Daly
Abstract: computational law has begun taking the role in society which has been predicted for some time. automated decision-making and systems which assist users are now used in various jurisdictions, but with this maturity come certain caveats. computational law exists on the platforms which enable it, in this case digital systems, which means that it inherits the same flaws. cybersecurity addresses these potential weaknesses. in this paper we go through known issues and discuss them in the various levels, from design to the physical realm. we also look at machine-learning specific adversarial problems. additionally, we make certain considerations regarding computational law and existing and future legislation. finally, we present three recommendations which are necessary for computational law to function globally, and which follow ideas in safety and security engineering. as indicated, we find that computational law must seriously consider that not only does it face the same risks as other types of software and computer systems, but that failures within it may cause financial or physical damage, as well as injustice. consequences of computational legal systems failing are greater than if they were merely software and hardware. if the system employs machine-learning, it must take note of the very specific dangers which this brings, of which data poisoning is the classic example. computational law must also be explicitly legislated for, which we show is not the case currently in the eu, and this is also true for the cybersecurity aspects that will be relevant to it. but there is great hope in eu's proposed ai act, which makes an important attempt at taking the specific problems which computational law bring into the legal sphere. our recommendations for computational law and cybersecurity are: accommodation of threats, adequate use, and that humans remain in the centre of their deployment.
Robert Wolfe, Aylin Caliskan
Abstract: three state-of-the-art language-and-image ai models, clip, slip, and blip, are evaluated for evidence of a bias previously observed in social and experimental psychology: equating american identity with being white. embedding association tests (eats) using standardized images of self-identified asian, black, latina/o, and white individuals from the chicago face database (cfd) reveal that white individuals are more associated with collective in-group words than are asian, black, or latina/o individuals. in assessments of three core aspects of american identity reported by social psychologists, single-category eats reveal that images of white individuals are more associated with patriotism and with being born in america, but that, consistent with prior findings in psychology, white individuals are associated with being less likely to treat people of all races and backgrounds equally. three downstream machine learning tasks demonstrate biases associating american with white. in a visual question answering task using blip, 97% of white individuals are identified as american, compared to only 3% of asian individuals. when asked in what state the individual depicted lives in, the model responds china 53% of the time for asian individuals, but always with an american state for white individuals. in an image captioning task, blip remarks upon the race of asian individuals as much as 36% of the time, but never remarks upon race for white individuals. finally, provided with an initialization image from the cfd and the text "an american person," a synthetic image generator (vqgan) using the text-based guidance of clip lightens the skin tone of individuals of all races (by 35% for black individuals, based on pixel brightness). the results indicate that biases equating american identity with being white are learned by language-and-image ai, and propagate to downstream applications of such models.

2022-06-30

Lionel Nganyewou Tidjon, Foutse Khomh
Abstract: machine learning is a field of artificial intelligence (ai) that is becoming essential for several critical systems, making it a good target for threat actors. threat actors exploit different tactics, techniques, and procedures (ttps) against the confidentiality, integrity, and availability of machine learning (ml) systems. during the ml cycle, they exploit adversarial ttps to poison data and fool ml-based systems. in recent years, multiple security practices have been proposed for traditional systems but they are not enough to cope with the nature of ml-based systems. in this paper, we conduct an empirical study of threats reported against ml-based systems with the aim to understand and characterize the nature of ml threats and identify common mitigation strategies. the study is based on 89 real-world ml attack scenarios from the mitre's atlas database, the ai incident database, and the literature; 854 ml repositories from the github search and the python packaging advisory database, selected based on their reputation. attacks from the ai incident database and the literature are used to identify vulnerabilities and new types of threats that were not documented in atlas. results show that convolutional neural networks were one of the most targeted models among the attack scenarios. ml repositories with the largest vulnerability prominence include tensorflow, opencv, and notebook. in this paper, we also report the most frequent vulnerabilities in the studied ml repositories, the most targeted ml phases and models, the most used ttps in ml phases and attack scenarios. this information is particularly important for red/blue teams to better conduct attacks/defenses, for practitioners to prevent threats during ml development, and for researchers to develop efficient defense mechanisms.
Amin Rasekh, Ian Eisenberg
Abstract: natural language generation models are computer systems that generate coherent language when prompted with a sequence of words as context. despite their ubiquity and many beneficial applications, language generation models also have the potential to inflict social harms by generating discriminatory language, hateful speech, profane content, and other harmful material. ethical assessment of these models is therefore critical. but it is also a challenging task, requiring an expertise in several specialized domains, such as computational linguistics and social justice. while significant strides have been made by the research community in this domain, accessibility of such ethical assessments to the wider population is limited due to the high entry barriers. this article introduces a new tool to democratize and standardize ethical assessment of natural language generation models: tool for ethical assessment of language generation models (teal), a component of credo ai lens, an open-source assessment framework.

2022-06-29

Venelin Kovatchev, Trina Chatterjee, Venkata S Govindarajan, Jifan Chen, Eunsol Choi, Gabriella Chronis, Anubrata Das, Katrin Erk, Matthew Lease, Junyi Jessy Li, Yating Wu, Kyle Mahowald
Abstract: developing methods to adversarially challenge nlp systems is a promising avenue for improving both model performance and interpretability. here, we describe the approach of the team "longhorns" on task 1 of the the first workshop on dynamic adversarial data collection (dadc), which asked teams to manually fool a model on an extractive question answering task. our team finished first, with a model error rate of 62%. we advocate for a systematic, linguistically informed approach to formulating adversarial questions, and we describe the results of our pilot experiments, as well as our official submission.

2022-06-27

Alexander Matt Turner, Prasad Tadepalli
Abstract: if capable ai agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. in fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive. however, the real world is neither fully observable, nor must trained agents be even approximately reward-optimal. we consider a range of models of ai decision-making, from optimal, to random, to choices informed by learning and interacting with an environment. we discover that many decision-making functions are retargetable, and that retargetability is sufficient to cause power-seeking tendencies. our functional criterion is simple and broad. we show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power. we demonstrate the flexibility of our results by reasoning about learned policy incentives in montezuma's revenge. these results suggest a safety risk: eventually, retargetable training procedures may train real-world agents which seek power over humans.

2022-06-25

John Nay, James Daily
Abstract: given that artificial intelligence (ai) increasingly permeates our lives, it is critical that we systematically align ai objectives with the goals and values of humans. the human-ai alignment problem stems from the impracticality of explicitly specifying the rewards that ai models should receive for all the actions they could take in all relevant states of the world. one possible solution, then, is to leverage the capabilities of ai models to learn those rewards implicitly from a rich source of data describing human values in a wide range of contexts. the democratic policy-making process produces just such data by developing specific rules, flexible standards, interpretable guidelines, and generalizable precedents that synthesize citizens' preferences over potential actions taken in many states of the world. therefore, computationally encoding public policies to make them legible to ai systems should be an important part of a socio-technical approach to the broader human-ai alignment puzzle. this essay outlines research on ai that learn structures in policy data that can be leveraged for downstream tasks. as a demonstration of the ability of ai to comprehend policy, we provide a case study of an ai system that predicts the relevance of proposed legislation to any given publicly traded company and its likely effect on that company. we believe this represents the "comprehension" phase of ai and policy, but leveraging policy as a key source of human values to align ai requires "understanding" policy. solving the alignment problem is crucial to ensuring that ai is beneficial both individually (to the person or group deploying the ai) and socially. as ai systems are given increasing responsibility in high-stakes contexts, integrating democratically-determined policy into those systems could align their behavior with human goals in a way that is responsive to a constantly evolving society.

2022-06-23

Virginia K. Felkner, Ho-Chun Herbert Chang, Eugene Jang, Jonathan May
Abstract: this paper presents exploratory work on whether and to what extent biases against queer and trans people are encoded in large language models (llms) such as bert. we also propose a method for reducing these biases in downstream tasks: finetuning the models on data written by and/or about queer people. to measure anti-queer bias, we introduce a new benchmark dataset, winoqueer, modeled after other bias-detection benchmarks but addressing homophobic and transphobic biases. we found that bert shows significant homophobic bias, but this bias can be mostly mitigated by finetuning bert on a natural language corpus written by members of the lgbtq+ community.
Yang Trista Cao, Anna Sotnikova, Hal Daumé, Rachel Rudinger, Linda Zou
Abstract: nlp models trained on text have been shown to reproduce human stereotypes, which can magnify harms to marginalized groups when systems are deployed at scale. we adapt the agency-belief-communion (abc) stereotype model of koch et al. (2016) from social psychology as a framework for the systematic study and discovery of stereotypic group-trait associations in language models (lms). we introduce the sensitivity test (set) for measuring stereotypical associations from language models. to evaluate set and other measures using the abc model, we collect group-trait judgments from u.s.-based subjects to compare with english lm stereotypes. finally, we extend this framework to measure lm stereotyping of intersectional identities.
Alexander Matt Turner
Abstract: we do not know how to align a very intelligent ai agent's behavior with human interests. i investigate whether -- absent a full solution to this ai alignment problem -- we can build smart ai agents which have limited impact on the world, and which do not autonomously seek power. in this thesis, i introduce the attainable utility preservation (aup) method. i demonstrate that aup produces conservative, option-preserving behavior within toy gridworlds and within complex environments based off of conway's game of life. i formalize the problem of side effect avoidance, which provides a way to quantify the side effects an agent had on the world. i also give a formal definition of power-seeking in the context of ai agents and show that optimal policies tend to seek power. in particular, most reward functions have optimal policies which avoid deactivation. this is a problem if we want to deactivate or correct an intelligent agent after we have deployed it. my theorems suggest that since most agent goals conflict with ours, the agent would very probably resist correction. i extend these theorems to show that power-seeking incentives occur not just for optimal decision-makers, but under a wide range of decision-making procedures.
A. Feder Cooper, Jonathan Frankle, Christopher De Sa
Abstract: legal literature on machine learning (ml) tends to focus on harms, and thus tends to reason about individual model outcomes and summary error rates. this focus has masked important aspects of ml that are rooted in its reliance on randomness -- namely, stochasticity and non-determinism. while some recent work has begun to reason about the relationship between stochasticity and arbitrariness in legal contexts, the role of non-determinism more broadly remains unexamined. in this paper, we clarify the overlap and differences between these two concepts, and show that the effects of non-determinism, and consequently its implications for the law, become clearer from the perspective of reasoning about ml outputs as distributions over possible outcomes. this distributional viewpoint accounts for randomness by emphasizing the possible outcomes of ml. importantly, this type of reasoning is not exclusive with current legal reasoning; it complements (and in fact can strengthen) analyses concerning individual, concrete outcomes for specific automated decisions. by illuminating the important role of non-determinism, we demonstrate that ml code falls outside of the cyberlaw frame of treating ``code as law,'' as this frame assumes that code is deterministic. we conclude with a brief discussion of what work ml can do to constrain the potentially harm-inducing effects of non-determinism, and we indicate where the law must do work to bridge the gap between its current individual-outcome focus and the distributional approach that we recommend.
Lionel Nganyewou Tidjon, Foutse Khomh
Abstract: artificial intelligence (ai) is becoming the corner stone of many systems used in our daily lives such as autonomous vehicles, healthcare systems, and unmanned aircraft systems. machine learning is a field of ai that enables systems to learn from data and make decisions on new data based on models to achieve a given goal. the stochastic nature of ai models makes verification and validation tasks challenging. moreover, there are intrinsic biaises in ai models such as reproductibility bias, selection bias (e.g., races, genders, color), and reporting bias (i.e., results that do not reflect the reality). increasingly, there is also a particular attention to the ethical, legal, and societal impacts of ai. ai systems are difficult to audit and certify because of their black-box nature. they also appear to be vulnerable to threats; ai systems can misbehave when untrusted data are given, making them insecure and unsafe. governments, national and international organizations have proposed several principles to overcome these challenges but their applications in practice are limited and there are different interpretations in the principles that can bias implementations. in this paper, we examine trust in the context of ai-based systems to understand what it means for an ai system to be trustworthy and identify actions that need to be undertaken to ensure that ai systems are trustworthy. to achieve this goal, we first review existing approaches proposed for ensuring the trustworthiness of ai systems, in order to identify potential conceptual gaps in understanding what trustworthy ai is. then, we suggest a trust (resp. zero-trust) model for ai and suggest a set of properties that should be satisfied to ensure the trustworthiness of ai systems.
Akhter Al Amin, Kazi Sinthia Kabir
Abstract: language models (lm) are becoming prevalent in many language-based application spaces globally. although these lms are improving our day-to-day interactions with digital products, concerns remain whether open-ended languages or text generated from these models reveal any biases toward a specific group of people, thereby risking the usability of a certain product. there is a need to identify whether these models possess bias to improve the fairness in these models. this gap motivates our ongoing work, where we measured the two aspects of bias in gpt-3 generated text through a disability lens.

2022-06-21

Tomasz Limisiewicz, David Mareček
Abstract: the representations in large language models contain multiple types of gender information. we focus on two types of such signals in english texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. we can disentangle the model's embeddings and identify components encoding both types of information with probing. we aim to diminish the stereotypical bias in the representations while preserving the factual gender signal. our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without significant deterioration of language modeling capabilities. the findings can be applied to language generation to mitigate reliance on stereotypes while preserving gender agreement in coreferences.
Tara Roberson, Stephen Bornstein, Rain Liivoja, Simon Ng, Jason Scholz, S. Kate Devitt
Abstract: what does it mean to be responsible and responsive when developing and deploying trusted autonomous systems in defence? in this short reflective article, we describe a case study of building a trusted autonomous system - athena ai - within an industry-led, government-funded project with diverse collaborators and stakeholders. using this case study, we draw out lessons on the value and impact of embedding responsible research and innovation-aligned, ethics-by-design approaches and principles throughout the development of technology at high translation readiness levels.

2022-06-20

Giovanni Apruzzese, Pavel Laskov, Edgardo Montes De Oca, Wissam Mallouli, Luis Burdalo Rapa, Athanasios Vasileios Grammatopoulos, Fabio Di Franco
Abstract: machine learning (ml) represents a pivotal technology for current and future information systems, and many domains already leverage the capabilities of ml. however, deployment of ml in cybersecurity is still at an early stage, revealing a significant discrepancy between research and practice. such discrepancy has its root cause in the current state-of-the-art, which does not allow to identify the role of ml in cybersecurity. the full potential of ml will never be unleashed unless its pros and cons are understood by a broad audience. this paper is the first attempt to provide a holistic understanding of the role of ml in the entire cybersecurity domain -- to any potential reader with an interest in this topic. we highlight the advantages of ml with respect to human-driven detection methods, as well as the additional tasks that can be addressed by ml in cybersecurity. moreover, we elucidate various intrinsic problems affecting real ml deployments in cybersecurity. finally, we present how various stakeholders can contribute to future developments of ml in cybersecurity, which is essential for further progress in this field. our contributions are complemented with two real case studies describing industrial applications of ml as defense against cyber-threats.
Yarden Tal, Inbal Magar, Roy Schwartz
Abstract: the size of pretrained models is increasing, and so is their performance on a variety of nlp tasks. however, as their memorization capacity grows, they might pick up more social biases. in this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). we measure bias in three masked language model families (roberta, deberta, and t5) in two setups: directly using prompt based method, and using a downstream task (winogender). we find on the one hand that larger models receive higher bias scores on the former task, but when evaluated on the latter, they make fewer gender errors. to examine these potentially conflicting results, we carefully investigate the behavior of the different models on winogender. we find that while larger models outperform smaller ones, the probability that their mistakes are caused by gender bias is higher. moreover, we find that the proportion of stereotypical errors compared to anti-stereotypical ones grows with the model size. our findings highlight the potential risks that can arise from increasing model size.
Roberto V. Zicari, Julia Amann, Frédérick Bruneault, Megan Coffee, Boris Düdder, Eleanore Hickman, Alessio Gallucci, Thomas Krendl Gilbert, Thilo Hagendorff, Irmhild Van Halem, Elisabeth Hildt, Sune Holm, Georgios Kararigas, Pedro Kringen, Vince I. Madai, Emilie Wiinblad Mathez, Jesmin Jahan Tithi, Dennis Vetter, Magnus Westerlund, Renee Wurth
Abstract: this report is a methodological reflection on z-inspection$^{\small{\circledr}}$. z-inspection$^{\small{\circledr}}$ is a holistic process used to evaluate the trustworthiness of ai-based technologies at different stages of the ai lifecycle. it focuses, in particular, on the identification and discussion of ethical issues and tensions through the elaboration of socio-technical scenarios. it uses the general european union's high-level expert group's (eu hleg) guidelines for trustworthy ai. this report illustrates for both ai researchers and ai practitioners how the eu hleg guidelines for trustworthy ai can be applied in practice. we share the lessons learned from conducting a series of independent assessments to evaluate the trustworthiness of ai systems in healthcare. we also share key recommendations and practical suggestions on how to ensure a rigorous trustworthy ai assessment throughout the life-cycle of an ai system.
Paul Röttger, Haitham Seelawi, Debora Nozza, Zeerak Talat, Bertie Vidgen
Abstract: hate speech detection models are typically evaluated on held-out test sets. however, this risks painting an incomplete and potentially misleading picture of model performance because of increasingly well-documented systematic gaps and biases in hate speech datasets. to enable more targeted diagnostic insights, recent research has thus introduced functional tests for hate speech detection models. however, these tests currently only exist for english-language content, which means that they cannot support the development of more effective models in other languages spoken by billions across the world. to help address this issue, we introduce multilingual hatecheck (mhc), a suite of functional tests for multilingual hate speech detection models. mhc covers 34 functionalities across ten languages, which is more languages than any other hate speech dataset. to illustrate mhc's utility, we train and test a high-performing multilingual hate speech detection model, and reveal critical model weaknesses for monolingual and cross-lingual applications.

2022-06-19

Sam Clarke, Ben Cottier, Aryeh Englander, Daniel Eth, David Manheim, Samuel Dylan Martin, Issa Rice
Abstract: this report outlines work by the modeling transformative ai risk (mtair) project, an attempt to map out the key hypotheses, uncertainties, and disagreements in debates about catastrophic risks from advanced ai, and the relationships between them. this builds on an earlier diagram by ben cottier and rohin shah which laid out some of the crucial disagreements ("cruxes") visually, with some explanation. based on an extensive literature review and engagement with experts, the report explains a model of the issues involved, and the initial software-based implementation that can incorporate probability estimates or other quantitative factors to enable exploration, planning, and/or decision support. by gathering information from various debates and discussions into a single more coherent presentation, we hope to enable better discussions and debates about the issues involved. the model starts with a discussion of reasoning via analogies and general prior beliefs about artificial intelligence. following this, it lays out a model of different paths and enabling technologies for high-level machine intelligence, and a model of how advances in the capabilities of these systems might proceed, including debates about self-improvement, discontinuous improvements, and the possibility of distributed, non-agentic high-level intelligence or slower improvements. the model also looks specifically at the question of learned optimization, and whether machine learning systems will create mesa-optimizers. the impact of different safety research on the previous sets of questions is then examined, to understand whether and how research could be useful in enabling safer systems. finally, we discuss a model of different failure modes and loss of control or takeover scenarios.
Inioluwa Deborah Raji, I. Elizabeth Kumar, Aaron Horowitz, Andrew D. Selbst
Abstract: deployed ai systems often do not work. they can be constructed haphazardly, deployed indiscriminately, and promoted deceptively. however, despite this reality, scholars, the press, and policymakers pay too little attention to functionality. this leads to technical and policy solutions focused on "ethical" or value-aligned deployments, often skipping over the prior question of whether a given system functions, or provides any benefits at all. to describe the harms of various types of functionality failures, we analyze a set of case studies to create a taxonomy of known ai functionality issues. we then point to policy and organizational responses that are often overlooked and become more readily available once functionality is drawn into focus. we argue that functionality is a meaningful ai policy challenge, operating as a necessary first step towards protecting affected communities from algorithmic harm.

2022-06-16

Joseph Carlsmith
Abstract: this report examines what i see as the core argument for concern about existential risk from misaligned artificial intelligence. i proceed in two stages. first, i lay out a backdrop picture that informs such concern. on this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire -- especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. second, i formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. on this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic ai systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) ai systems than to build misaligned (and relevantly powerful/agentic) ai systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. i assign rough subjective credences to the premises in this argument, and i end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (may 2022 update: since making this report public in april 2021, my estimate here has gone up, and is now at >10%.)

2022-06-15

Lachlan D. Urquhart, Glenn Mcgarry, Andy Crabtree
Abstract: we consider a series of legal provocations emerging from the proposed european union ai act 2021 (aia) and how they open up new possibilities for hci in the design and development of trustworthy autonomous systems. the aia continues the by design trend seen in recent eu regulation of emerging technologies. the aia targets ai developments that pose risks to society and citizens fundamental rights, introducing mandatory design and development requirements for high-risk ai systems (hrais). these requirements regulate different stages of the ai development cycle including ensuring data quality and governance strategies, mandating testing of systems, ensuring appropriate risk management, designing for human oversight, and creating technical documentation. these requirements open up new opportunities for hci that reach beyond established concerns with the ethics and explainability of ai and situate ai development in human-centered processes and methods of design to enable compliance with regulation and foster societal trust in ai.
Mengyi Wei, Zhixuan Zhou
Abstract: with the powerful performance of artificial intelligence (ai) also comes prevalent ethical issues. though governments and corporations have curated multiple ai ethics guidelines to curb unethical behavior of ai, the effect has been limited, probably due to the vagueness of the guidelines. in this paper, we take a closer look at how ai ethics issues take place in real world, in order to have a more in-depth and nuanced understanding of different ethical issues as well as their social impact. with a content analysis of ai incident database, which is an effort to prevent repeated real world ai failures by cataloging incidents, we identified 13 application areas which often see unethical use of ai, with intelligent service robots, language/vision models and autonomous driving taking the lead. ethical issues appear in 8 different forms, from inappropriate use and racial discrimination, to physical safety and unfair algorithm. with this taxonomy of ai ethics issues, we aim to provide ai practitioners with a practical guideline when trying to deploy ai applications ethically.
Theodore R Sumers, Robert D Hawkins, Mark K Ho, Thomas L Griffiths, Dylan Hadfield-Menell
Abstract: from the earliest years of our lives, humans use language to express our beliefs and desires. being able to talk to artificial agents about our preferences would thus fulfill a central goal of value alignment. yet today, we lack computational models explaining such language use. to address this challenge, we formalize learning from language in a contextual bandit setting and ask how a human might communicate preferences over behaviors. we study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function. we show that the agent's degree of autonomy determines which form of language is optimal: instructions are better in low-autonomy settings, but descriptions are better when the agent will need to act independently. we then define a pragmatic listener agent that robustly infers the speaker's reward function by reasoning about $\textit{how}$ the speaker expresses themselves. we validate our models with a behavioral experiment, demonstrating that (1) our speaker model predicts human behavior, and (2) our pragmatic listener successfully recovers humans' reward functions. finally, we show that this form of social learning can integrate with and reduce regret in traditional reinforcement learning. we hope these insights facilitate a shift from developing agents that $\textit{obey}$ language to agents that $\textit{learn}$ from it.

2022-06-12

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, Jan Leike
Abstract: we fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. on a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading. we study scaling properties of critiquing with both topic-based summarization and synthetic tasks. larger models write more helpful critiques, and on most tasks, are better at self-critiquing, despite having harder-to-critique outputs. larger models can also integrate their own self-critiques as feedback, refining their own summaries into better ones. finally, we motivate and introduce a framework for comparing critiquing ability to generation and discrimination ability. our measurements suggest that even large models may still have relevant knowledge they cannot or do not articulate as critiques. these results are a proof of concept for using ai-assisted human feedback to scale the supervision of machine learning systems to tasks that are difficult for humans to evaluate directly. we release our training datasets, as well as samples from our critique assistance experiments.
Dan Hendrycks, Mantas Mazeika
Abstract: artificial intelligence (ai) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. current ai research lacks a systematic discussion of how to manage long-tail risks from ai systems, including speculative long-term risks. keeping in mind the potential benefits of ai, there is some concern that building ever more intelligent and powerful ai systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). to add precision and ground these discussions, we provide a guide for how to analyze ai x-risk, which consists of three parts: first, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. next, we discuss strategies for having long-term impacts on the safety of future systems. finally, we discuss a crucial concept in making ai systems safer by improving the balance between safety and general capabilities. we hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze ai x-risk.

2022-06-09

Amit Sheth, Manas Gaur, Kaushik Roy, Revathy Venkataraman, Vedant Khandelwal
Abstract: ai systems have been widely adopted across various domains in the real world. however, in high-value, sensitive, or safety-critical applications such as self-management for personalized health or food recommendation with a specific purpose (e.g., allergy-aware recipe recommendations), their adoption is unlikely. firstly, the ai system needs to follow guidelines or well-defined processes set by experts; the data alone will not be adequate. for example, to diagnose the severity of depression, mental healthcare providers use patient health questionnaire (phq-9). so if an ai system were to be used for diagnosis, the medical guideline implied by the phq-9 needs to be used. likewise, a nutritionist's knowledge and steps would need to be used for an ai system that guides a diabetic patient in developing a food plan. second, the blackbox nature typical of many current ai systems will not work; the user of an ai system will need to be able to give user-understandable explanations, explanations constructed using concepts that humans can understand and are familiar with. this is the key to eliciting confidence and trust in the ai system. for such applications, in addition to data and domain knowledge, the ai systems need to have access to and use the process knowledge, an ordered set of steps that the ai system needs to use or adhere to.

2022-06-08

Esma Balkir, Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser
Abstract: motivations for methods in explainable artificial intelligence (xai) often include detecting, quantifying and mitigating bias, and contributing to making machine learning models fairer. however, exactly how an xai method can help in combating biases is often left unspecified. in this paper, we briefly review trends in explainability and fairness in nlp research, identify the current practices in which explainability methods are applied to detect and mitigate bias, and investigate the barriers preventing xai methods from being used more widely in tackling fairness issues.

2022-06-06

Charl Maree, Jan Erik Modal, Christian W. Omlin
Abstract: the application of ai in finance is increasingly dependent on the principles of responsible ai. these principles - explainability, fairness, privacy, accountability, transparency and soundness form the basis for trust in future ai systems. in this study, we address the first principle by providing an explanation for a deep neural network that is trained on a mixture of numerical, categorical and textual inputs for financial transaction classification. the explanation is achieved through (1) a feature importance analysis using shapley additive explanations (shap) and (2) a hybrid approach of text clustering and decision tree classifiers. we then test the robustness of the model by exposing it to a targeted evasion attack, leveraging the knowledge we gained about the model through the extracted explanation.
Thao Le, Tim Miller, Ronal Singh, Liz Sonenberg
Abstract: in this paper, we show that counterfactual explanations of confidence scores help users better understand and better trust an ai model's prediction in human-subject studies. showing confidence scores in human-agent interaction systems can help build trust between humans and ai systems. however, most existing research only used the confidence score as a form of communication, and we still lack ways to explain why the algorithm is confident. this paper also presents two methods for understanding model confidence using counterfactual explanation: (1) based on counterfactual examples; and (2) based on visualisation of the counterfactual space.
Jan H. Kirchner, Logan Smith, Jacques Thibodeau, Kyle Mcdonell, Laria Reynolds
Abstract: ai alignment research is the field of study dedicated to ensuring that artificial intelligence (ai) benefits humans. as machine intelligence gets more advanced, this research is becoming increasingly important. researchers in the field share ideas across different media to speed up the exchange of information. however, this focus on speed means that the research landscape is opaque, making it difficult for young researchers to enter the field. in this project, we collected and analyzed existing ai alignment research. we found that the field is growing quickly, with several subfields emerging in parallel. we looked at the subfields and identified the prominent researchers, recurring topics, and different modes of communication in each. furthermore, we found that a classifier trained on ai alignment research articles can detect relevant articles that we did not originally include in the dataset. we are sharing the dataset with the research community and hope to develop tools in the future that will help both established researchers and young researchers get more involved in the field.

2022-06-05

Xin Lian
Abstract: with the covid-19 pandemic continuing, hatred against asians is intensifying in countries outside asia, especially among the chinese. there is an urgent need to detect and prevent hate speech towards asians effectively. in this work, we first create covid-hate-2022, an annotated dataset including 2,025 annotated tweets fetched in early february 2022, which are labeled based on specific criteria, and we present the comprehensive collection of scenarios of hate and non-hate tweets in the dataset. second, we fine-tune the bert model based on the relevant datasets and demonstrate several strategies related to the "cleaning" of the tweets. third, we investigate the performance of advanced fine-tuning strategies with various model-centric and data-centric approaches, and we show that both strategies generally improve the performance, while data-centric ones outperform the others, and it demonstrates the feasibility and effectiveness of the data-centric approaches in the associated tasks.
Daniil Moskovskiy, Daryna Dementieva, Alexander Panchenko
Abstract: detoxification is a task of generating text in polite style while preserving meaning and fluency of the original toxic text. existing detoxification methods are designed to work in one exact language. this work investigates multilingual and cross-lingual detoxification and the behavior of large multilingual models like in this setting. unlike previous works we aim to make large language models able to perform detoxification without direct fine-tuning in given language. experiments show that multilingual models are capable of performing multilingual style transfer. however, models are not able to perform cross-lingual detoxification and direct fine-tuning on exact language is inevitable.

2022-06-01

Yuri Nakao, Lorenzo Strappelli, Simone Stumpf, Aisha Naseer, Daniele Regoli, Giulia Del Gamba
Abstract: with artificial intelligence (ai) to aid or automate decision-making advancing rapidly, a particular concern is its fairness. in order to create reliable, safe and trustworthy systems through human-centred artificial intelligence (hcai) design, recent efforts have produced user interfaces (uis) for ai experts to investigate the fairness of ai models. in this work, we provide a design space exploration that supports not only data scientists but also domain experts to investigate ai fairness. using loan applications as an example, we held a series of workshops with loan officers and data scientists to elicit their requirements. we instantiated these requirements into fairhil, a ui to support human-in-the-loop fairness investigations, and describe how this ui could be generalized to other use cases. we evaluated fairhil through a think-aloud user study. our work contributes better designs to investigate an ai model's fairness-and move closer towards responsible ai.
Sullam Jeoung, Jana Diesner
Abstract: previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. however, what we don't understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. in this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers of language models, and attention heads.

2022-05-31

Giannis Daras, Alexandros G. Dimakis
Abstract: we discover that dalle-2 seems to have a hidden vocabulary that can be used to generate images with absurd prompts. for example, it seems that \texttt{apoploe vesrreaitais} means birds and \texttt{contarra ccetnxniams luryca tanniounons} (sometimes) means bugs or pests. we find that these prompts are often consistent in isolation but also sometimes in combinations. we present our black-box method to discover words that seem random but have some correspondence to visual concepts. this creates important security and interpretability challenges.

2022-05-27

Awantee Deshpande, Dana Ruiter, Marius Mosbach, Dietrich Klakow
Abstract: analyzing ethnic or religious bias is important for improving fairness, accountability, and transparency of natural language processing models. however, many techniques rely on human-compiled lists of bias terms, which are expensive to create and are limited in coverage. in this study, we present a fully data-driven pipeline for generating a knowledge graph (kg) of cultural knowledge and stereotypes. our resulting kg covers 5 religious groups and 5 nationalities and can easily be extended to include more entities. our human evaluation shows that the majority (59.2%) of non-singleton entries are coherent and complete stereotypes. we further show that performing intermediate masked language model training on the verbalized kg leads to a higher level of cultural awareness in the model and has the potential to increase classification performance on knowledge-crucial samples on a related task, i.e., hate speech detection.

2022-05-25

Rebecca Qian, Candace Ross, Jude Fernandes, Eric Smith, Douwe Kiela, Adina Williams
Abstract: unwanted and often harmful social biases are becoming ever more salient in nlp research, affecting both models and datasets. in this work, we ask whether training on demographically perturbed data leads to fairer language models. we collect a large dataset of human annotated text perturbations and train a neural perturbation model, which we show outperforms heuristic alternatives. we find that (i) language models (lms) pre-trained on demographically perturbed corpora are typically more fair, and (ii) lms finetuned on perturbed glue datasets exhibit less demographic bias on downstream tasks, and (iii) fairness improvements do not come at the expense of performance on downstream tasks. lastly, we discuss outstanding questions about how best to evaluate the (un)fairness of large language models. we hope that this exploration of neural demographic perturbation will help drive more improvement towards fairer nlp.
Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang
Abstract: are large pre-trained language models leaking your personal information? in this paper, we analyze whether pre-trained language models (plms) are prone to leaking personal information. specifically, we query plms for email addresses with contexts of the email address or prompts containing the owner's name. we find that plms do leak personal information due to memorization. however, since the models are weak at association, the risk of specific personal information being extracted by attackers is low. we hope this work could help the community to better understand the privacy risk of plms and bring new insights to make plms safe.
Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jaesung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, Yejin Choi
Abstract: large language models readily adapt to novel settings, even without task-specific training data. can their zero-shot capacity be extended to multimodal inputs? in this work, we propose esper which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, in the image case our reward optimization relies only on cosine similarity derived from clip, and thus requires no additional explicitly paired (image, caption) data. because the parameters of the language model are left unchanged, the model maintains its capacity for zero-shot generalization. experiments demonstrate that esper outperforms baselines and prior work on a variety of zero-shot tasks; these include a new benchmark we collect+release, esp dataset, which tasks models with generating several diversely-styled captions for each image.
Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, Maarten Sap
Abstract: most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. to address this issue, we introduce prosocialdialog, the first large-scale multi-turn dialogue dataset to teach conversational agents to respond to problematic content following social norms. covering diverse unethical, problematic, biased, and toxic situations, prosocialdialog contains responses that encourage prosocial behavior, grounded in commonsense social rules (i.e., rules-of-thumb, rots). created via a human-ai collaborative framework, prosocialdialog consists of 58k dialogues, with 331k utterances, 160k unique rots, and 497k dialogue safety labels accompanied by free-form rationales. with this dataset, we introduce a dialogue safety detection module, canary, capable of generating rots given conversational context, and a socially-informed dialogue agent, prost. empirical results show that prost generates more socially acceptable dialogues compared to other state-of-the-art language and dialogue models in both in-domain and out-of-domain settings. additionally, canary effectively guides conversational agents and off-the-shelf language models to generate significantly more prosocial responses. our work highlights the promise and importance of creating and steering conversational ai to be socially responsible.
Sascha Saralajew, Ammar Shaker, Zhao Xu, Kiril Gashteovski, Bhushan Kotnis, Wiem Ben Rim, Jürgen Quittek, Carolin Lawrence
Abstract: with the rise of ai systems in real-world applications comes the need for reliable and trustworthy ai. an essential aspect of this are explainable ai systems. however, there is no agreed standard on how explainable ai systems should be assessed. inspired by the turing test, we introduce a human-centric assessment framework where a leading domain expert accepts or rejects the solutions of an ai system and another domain expert. by comparing the acceptance rates of provided solutions, we can assess how the ai system performs compared to the domain expert, and whether the ai system's explanations (if provided) are human-understandable. this setup -- comparable to the turing test -- can serve as a framework for a wide range of human-centric ai system assessments. we demonstrate this by presenting two instantiations: (1) an assessment that measures the classification accuracy of a system with the option to incorporate label uncertainties; (2) an assessment where the usefulness of provided explanations is determined in a human-centric manner.

2022-05-24

Pablo Mosteiro, Jesse Kuiper, Judith Masthoff, Floortje Scheepers, Marco Spruit
Abstract: fairness and bias are crucial concepts in artificial intelligence, yet they are relatively ignored in machine learning applications in clinical psychiatry. we computed fairness metrics and present bias mitigation strategies using a model trained on clinical mental health data. we collected structured data related to the admission, diagnosis, and treatment of patients in the psychiatry department of the university medical center utrecht. we trained a machine learning model to predict future administrations of benzodiazepines on the basis of past data. we found that gender plays an unexpected role in the predictions-this constitutes bias. using the ai fairness 360 package, we implemented reweighing and discrimination-aware regularization as bias mitigation strategies, and we explored their implications for model performance. this is the first application of bias exploration and mitigation in a machine learning model trained on real clinical psychiatry data.
Yau-Shian Wang, Yingshan Chang
Abstract: due to the subtleness, implicity, and different possible interpretations perceived by different people, detecting undesirable content from text is a nuanced difficulty. it is a long-known risk that language models (lms), once trained on corpus containing undesirable content, have the power to manifest biases and toxicity. however, recent studies imply that, as a remedy, lms are also capable of identifying toxic content without additional fine-tuning. prompt-methods have been shown to effectively harvest this surprising self-diagnosing capability. however, existing prompt-based methods usually specify an instruction to a language model in a discriminative way. in this work, we explore the generative variant of zero-shot prompt-based toxicity detection with comprehensive trials on prompt engineering. we evaluate on three datasets with toxicity labels annotated on social media posts. our analysis highlights the strengths of our generative classification approach both quantitatively and qualitatively. interesting aspects of self-diagnosis and its ethical implications are discussed.

2022-05-23

Tomasz Korbak, Ethan Perez, Christopher L Buckley
Abstract: reinforcement learning (rl) is frequently employed in fine-tuning large language models (lms), such as gpt-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. the rl formulation involves treating the lm as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. in this paper, we analyze challenges associated with treating a language model as an rl policy and show how avoiding those challenges requires moving beyond the rl paradigm. we start by observing that the standard rl approach is flawed as an objective for fine-tuning lms because it leads to distribution collapse: turning the lm into a degenerate distribution. then, we analyze kl-regularised rl, a widely used recipe for fine-tuning lms, which additionally constrains the fine-tuned lm to stay close to its original distribution in terms of kullback-leibler (kl) divergence. we show that kl-regularised rl is equivalent to variational inference: approximating a bayesian posterior which specifies how to update a prior lm to conform with evidence provided by the reward function. we argue that this bayesian inference view of kl-regularised rl is more insightful than the typically employed rl perspective. the bayesian inference view explains how kl-regularised rl avoids the distribution collapse problem and offers a first-principles derivation for its objective. while this objective happens to be equivalent to rl (with a particular choice of parametric reward), there exist other objectives for fine-tuning lms which are no longer equivalent to rl. that observation leads to a more general point: rl is not an adequate formal framework for problems such as fine-tuning language models. these problems are best viewed as bayesian inference: approximating a pre-defined target distribution.
Conrad Borchers, Dalia Sara Gala, Benjamin Gilburt, Eduard Oravkin, Wilfried Bounsi, Yuki M. Asano, Hannah Rose Kirk
Abstract: the growing capability and availability of generative language models has enabled a wide range of new downstream tasks. academic research has identified, quantified and mitigated biases present in language models but is rarely tailored to downstream tasks where wider impact on individuals and society can be felt. in this work, we leverage one popular generative language model, gpt-3, with the goal of writing unbiased and realistic job advertisements. we first assess the bias and realism of zero-shot generated advertisements and compare them to real-world advertisements. we then evaluate prompt-engineering and fine-tuning as debiasing methods. we find that prompt-engineering with diversity-encouraging prompts gives no significant improvement to bias, nor realism. conversely, fine-tuning, especially on unbiased real advertisements, can improve realism and reduce bias.
Afra Feyza Akyürek, Muhammed Yusuf Kocyigit, Sejin Paik, Derry Wijaya
Abstract: researchers have devised numerous ways to quantify social biases vested in pretrained language models. as some language models are capable of generating coherent completions given a set of textual prompts, several prompting datasets have been proposed to measure biases between social groups -- posing language generation as a way of identifying biases. in this opinion paper, we analyze how specific choices of prompt sets, metrics, automatic tools and sampling strategies affect bias results. we find out that the practice of measuring biases through text completion is prone to yielding contradicting results under different experiment settings. we additionally provide recommendations for reporting biases in open-ended language generation for a more complete outlook of biases exhibited by a given language model. code to reproduce the results is released under https://github.com/feyzaakyurek/bias-textgen.
Afra Feyza Akyürek, Sejin Paik, Muhammed Yusuf Kocyigit, Seda Akbiyik, Şerife Leman Runyun, Derry Wijaya
Abstract: large language models trained on a mixture of nlp tasks that are converted into a text-to-text format using prompts, can generalize into novel forms of language and handle novel tasks. a large body of work within prompt engineering attempts to understand the effects of input forms and prompts in achieving superior performance. we consider an alternative measure and inquire whether the way in which an input is encoded affects social biases promoted in outputs. in this paper, we study t0, a large-scale multi-task text-to-text language model trained using prompt-based learning. we consider two different forms of semantically equivalent inputs: question-answer format and premise-hypothesis format. we use an existing bias benchmark for the former bbq and create the first bias benchmark in natural language inference bbnli with hand-written hypotheses while also converting each benchmark into the other form. the results on two benchmarks suggest that given two different formulations of essentially the same input, t0 conspicuously acts more biased in question answering form, which is seen during training, compared to premise-hypothesis form which is unlike its training examples. code and data are released under https://github.com/feyzaakyurek/bbnli.

2022-05-22

Virginia Dignum
Abstract: the impact of artificial intelligence does not depend only on fundamental research and technological developments, but for a large part on how these systems are introduced into society and used in everyday situations. ai is changing the way we work, live and solve challenges but concerns about fairness, transparency or privacy are also growing. ensuring responsible, ethical ai is more than designing systems whose result can be trusted. it is about the way we design them, why we design them, and who is involved in designing them. in order to develop and use ai responsibly, we need to work towards technical, societal, institutional and legal methods and tools which provide concrete support to ai practitioners, as well as awareness and training to enable participation of all, to ensure the alignment of ai systems with our societies' principles and values.

2022-05-20

Samuel Sousa, Roman Kern
Abstract: deep learning (dl) models for natural language processing (nlp) tasks often handle private data, demanding protection against breaches and disclosures. data protection laws, such as the european union's general data protection regulation (gdpr), thereby enforce the need for privacy. although many privacy-preserving nlp methods have been proposed in recent years, no categories to organize them have been introduced yet, making it hard to follow the progress of the literature. to close this gap, this article systematically reviews over sixty dl methods for privacy-preserving nlp published between 2016 and 2020, covering theoretical foundations, privacy-enhancing technologies, and analysis of their suitability for real-world scenarios. first, we introduce a novel taxonomy for classifying the existing methods into three categories: data safeguarding methods, trusted methods, and verification methods. second, we present an extensive summary of privacy threats, datasets for applications, and metrics for privacy evaluation. third, throughout the review, we describe privacy issues in the nlp pipeline in a holistic view. further, we discuss open challenges in privacy-preserving nlp regarding data traceability, computation overhead, dataset size, the prevalence of human biases in embeddings, and the privacy-utility tradeoff. finally, this review presents future research directions to guide successive research and development of privacy-preserving nlp models.

2022-05-19

Samhita Honnavalli, Aesha Parekh, Lily Ou, Sophie Groenwold, Sharon Levy, Vicente Ordonez, William Yang Wang
Abstract: women are often perceived as junior to their male counterparts, even within the same job titles. while there has been significant progress in the evaluation of gender bias in natural language processing (nlp), existing studies seldom investigate how biases toward gender groups change when compounded with other societal biases. in this work, we investigate how seniority impacts the degree of gender bias exhibited in pretrained neural generation models by introducing a novel framework for probing compound bias. we contribute a benchmark robustness-testing dataset spanning two domains, u.s. senatorship and professorship, created using a distant-supervision method. our dataset includes human-written text with underlying ground truth and paired counterfactuals. we then examine gpt-2 perplexity and the frequency of gendered language in generated text. our results show that gpt-2 amplifies bias by considering women as junior and men as senior more often than the ground truth in both domains. these results suggest that nlp applications built using gpt-2 may harm women in professional capacities.

2022-05-15

Allison Lahnala, Charles Welch, Béla Neuendorf, Lucie Flek
Abstract: large pre-trained neural language models have supported the effectiveness of many nlp tasks, yet are still prone to generating toxic language hindering the safety of their use. using empathetic data, we improve over recent work on controllable text generation that aims to reduce the toxicity of generated text. we find we are able to dramatically reduce the size of fine-tuning data to 7.5-30k samples while at the same time making significant improvements over state-of-the-art toxicity mitigation of up to 3.4% absolute reduction (26% relative) from the original work on 2.3m samples, by strategically sampling data based on empathy scores. we observe that the degree of improvement is subject to specific communication components of empathy. in particular, the cognitive components of empathy significantly beat the original dataset in almost all experiments, while emotional empathy was tied to less improvement and even underperforming random samples of the original data. this is a particularly implicative insight for nlp work concerning empathy as until recently the research and resources built for it have exclusively considered empathy as an emotional concept.

2022-05-12

Gaurav Maheshwari, Pascal Denis, Mikaela Keller, Aurélien Bellet
Abstract: encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. in this work, we propose federate, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. we empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four nlp datasets. our results show that federate consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.
Sarah Alnegheimish, Alicia Guo, Yi Sun
Abstract: evaluation of biases in language models is often limited to synthetically generated datasets. this dependence traces back to the need for a prompt-style dataset to trigger specific behaviors of language models. in this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in wikipedia. we aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. we find bias evaluations are very sensitive to the design choices of template prompts, and we propose using natural sentence prompts for systematic evaluations to step away from design choices that could introduce bias in the observations.

2022-05-09

Anton Korinek, Avital Balwit
Abstract: as artificial intelligence (ai) becomes more powerful and widespread, the ai alignment problem - how to ensure that ai systems pursue the goals that we want them to pursue - has garnered growing attention. this article distinguishes two types of alignment problems depending on whose goals we consider, and analyzes the different solutions necessitated by each. the direct alignment problem considers whether an ai system accomplishes the goals of the entity operating it. in contrast, the social alignment problem considers the effects of an ai system on larger groups or on society more broadly. in particular, it also considers whether the system imposes externalities on others. whereas solutions to the direct alignment problem center around more robust implementation, social alignment problems typically arise because of conflicts between individual and group-level goals, elevating the importance of ai governance to mediate such conflicts. addressing the social alignment problem requires both enforcing existing norms on their developers and operators and designing new norms that apply directly to ai systems.
Punyajoy Saha, Kanishk Singh, Adarsh Kumar, Binny Mathew, Animesh Mukherjee
Abstract: recently, many studies have tried to create generation models to assist counter speakers by providing counterspeech suggestions for combating the explosive proliferation of online hate. however, since these suggestions are from a vanilla generation model, they might not include the appropriate properties required to counter a particular hate speech instance. in this paper, we propose countergedi - an ensemble of generative discriminators (gedi) to guide the generation of a dialogpt model toward more polite, detoxified, and emotionally laden counterspeech. we generate counterspeech using three datasets and observe significant improvement across different attribute scores. the politeness and detoxification scores increased by around 15% and 6% respectively, while the emotion in the counterspeech increased by at least 10% across all the datasets. we also experiment with triple-attribute control and observe significant improvement over single attribute results when combining complementing attributes, e.g., politeness, joyfulness and detoxification. in all these experiments, the relevancy of the generated text does not deteriorate due to the application of these controls
Mireia Yurrita, Dave Murray-Rust, Agathe Balayn, Alessandro Bozzon
Abstract: in an effort to regulate machine learning-driven (ml) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. while these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ml-driven systems are still underrepresented in auditing processes. such unaddressed values mainly deal with contextual factors that cannot be easily quantified. in this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. in order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. however, some of these value-specific criteria are mutually exclusive and require negotiation. as opposed to some other auditing frameworks that merely rely on ml researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. to that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. we, therefore, contribute to current ml auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.

2022-05-06

Elliott Waissbluth, Hany Farid, Vibhor Sehgal, Ankit Peshin, Sadia Afroz
Abstract: how, in 20 short years, did we go from the promise of the internet to democratize access to knowledge and make the world more understanding and enlightened, to the litany of daily horrors that is today's internet? we are awash in disinformation consisting of lies, conspiracies, and general nonsense, all with real-world implications ranging from horrific humans rights violations to threats to our democracy and global public health. although the internet is vast, the peddlers of disinformation appear to be more localized. to this end, we describe a domain-level analysis for predicting if a domain is complicit in distributing or amplifying disinformation. this process analyzes the underlying domain content and the hyperlinking connectivity between domains to predict if a domain is peddling in disinformation. these basic insights extend to an analysis of disinformation on telegram and twitter. from these insights, we propose that search engines and social-media recommendation algorithms can systematically discover and demote the worst disinformation offenders, returning some trust and sanity to our online communities.

2022-05-05

George J. Cancro, Shimei Pan, James Foulds
Abstract: when a human receives a prediction or recommended course of action from an intelligent agent, what additional information, beyond the prediction or recommendation itself, does the human require from the agent to decide whether to trust or reject the prediction or recommendation? in this paper we survey literature in the area of trust between a single human supervisor and a single agent subordinate to determine the nature and extent of this additional information and to characterize it into a taxonomy that can be leveraged by future researchers and intelligent agent practitioners. by examining this question from a human-centered, information-focused point of view, we can begin to compare and contrast different implementations and also provide insight and directions for future work.

2022-05-04

Prithviraj Ammanabrolu, Liwei Jiang, Maarten Sap, Hannaneh Hajishirzi, Yejin Choi
Abstract: we focus on creating agents that act in alignment with socially beneficial norms and values in interactive narratives or text-based games -- environments wherein an agent perceives and interacts with a world through natural language. such interactive agents are often trained via reinforcement learning to optimize task performance, even when such rewards may lead to agent behaviors that violate societal norms -- causing harm either to the agent itself or other entities in the environment. social value alignment refers to creating agents whose behaviors conform to expected moral and social norms for a given context and group of people -- in our case, it means agents that behave in a manner that is less harmful and more beneficial for themselves and others. we build on the jiminy cricket benchmark (hendrycks et al. 2021), a set of 25 annotated interactive narratives containing thousands of morally salient scenarios covering everything from theft and bodily harm to altruism. we introduce the galad (game-value alignment through action distillation) agent that uses the social commonsense knowledge present in specially trained language models to contextually restrict its action space to only those actions that are aligned with socially beneficial values. an experimental study shows that the galad agent makes decisions efficiently enough to improve state-of-the-art task performance by 4% while reducing the frequency of socially harmful behaviors by 25% compared to strong contemporary value alignment approaches.
Johannes Himmelreich, Désirée Lim
Abstract: this chapter argues for a structural injustice approach to the governance of ai. structural injustice has an analytical and an evaluative component. the analytical component consists of structural explanations that are well-known in the social sciences. the evaluative component is a theory of justice. structural injustice is a powerful conceptual tool that allows researchers and practitioners to identify, articulate, and perhaps even anticipate, ai biases. the chapter begins with an example of racial bias in ai that arises from structural injustice. the chapter then presents the concept of structural injustice as introduced by the philosopher iris marion young. the chapter moreover argues that structural injustice is well suited as an approach to the governance of ai and compares this approach to alternative approaches that start from analyses of harms and benefits or from value statements. the chapter suggests that structural injustice provides methodological and normative foundations for the values and concerns of diversity, equity, and inclusion. the chapter closes with an outlook onto the idea of structure and on responsibility. the idea of a structure is central to justice. an open theoretical research question is to what extent ai is itself part of the structure of society. finally, the practice of responsibility is central to structural injustice. even if they cannot be held responsible for the existence of structural injustice, every individual and every organization has some responsibility to address structural injustice going forward.
Ninareh Mehrabi, Ahmad Beirami, Fred Morstatter, Aram Galstyan
Abstract: warning: this paper contains content that maybe offensive or upsetting. recent research in natural language processing (nlp) has advanced the development of various toxicity detection models with the intention of identifying and mitigating toxic language from existing systems. despite the abundance of research in this area, less attention has been given to adversarial attacks that force the system to generate toxic language and the defense against them. existing work to generate such attacks is either based on human-generated attacks which is costly and not scalable or, in case of automatic attacks, the attack vector does not conform to human-like language, which can be detected using a language model loss. in this work, we propose attacks against conversational agents that are imperceptible, i.e., they fit the conversation in terms of coherency, relevancy, and fluency, while they are effective and scalable, i.e., they can automatically trigger the system into generating toxic language. we then propose a defense mechanism against such attacks which not only mitigates the attack but also attempts to maintain the conversational flow. through automatic and human evaluations, we show that our defense is effective at avoiding toxic language generation even against imperceptible toxicity triggers while the generated language fits the conversation in terms of coherency and relevancy. lastly, we establish the generalizability of such a defense mechanism on language generation models beyond conversational agents.

2022-05-03

Xuandong Zhao, Lei Li, Yu-Xiang Wang
Abstract: large language models are shown to memorize privacy information such as social security numbers in training data. given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. in this paper, we propose confidentially redacted training (crt), a method to train language generation models while protecting the confidential segments. we borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. we implement the method for both lstm and gpt language models. our experimental results show that the models trained by crt obtain almost the same perplexity while preserving strong confidentiality.

2022-05-01

Masahiro Kaneko, Aizhan Imankulova, Danushka Bollegala, Naoaki Okazaki
Abstract: masked language models (mlms) pre-trained by predicting masked tokens on large corpora have been used successfully in natural language processing tasks for a variety of languages. unfortunately, it was reported that mlms also learn discriminative biases regarding attributes such as gender and race. because most studies have focused on mlms in english, the bias of mlms in other languages has rarely been investigated. manual annotation of evaluation data for languages other than english has been challenging due to the cost and difficulty in recruiting annotators. moreover, the existing bias evaluation methods require the stereotypical sentence pairs consisting of the same context with attribute words (e.g. he/she is a nurse). we propose multilingual bias evaluation (mbe) score, to evaluate bias in various languages using only english attribute word lists and parallel corpora between the target language and english without requiring manually annotated data. we evaluated mlms in eight languages using the mbe and confirmed that gender-related biases are encoded in mlms for all those languages. we manually created datasets for gender bias in japanese and russian to evaluate the validity of the mbe. the results show that the bias scores reported by the mbe significantly correlates with that computed from the above manually created datasets and the existing english datasets for gender bias.

2022-04-30

Yoon A Park, Frank Rudzicz
Abstract: existing studies have investigated the tendency of autoregressive language models to generate contexts that exhibit undesired biases and toxicity. various debiasing approaches have been proposed, which are primarily categorized into data-based and decoding-based. in our study, we investigate the ensemble of the two debiasing paradigms, proposing to use toxic corpus as an additional resource to reduce the toxicity. our result shows that toxic corpus can indeed help to reduce the toxicity of the language generation process substantially, complementing the existing debiasing methods.

2022-04-28

Q. Vera Liao, S. Shyam Sundar
Abstract: current literature and public discourse on "trust in ai" are often focused on the principles underlying trustworthy ai, with insufficient attention paid to how people develop trust. given that ai systems differ in their level of trustworthiness, two open questions come to the fore: how should ai trustworthiness be responsibly communicated to ensure appropriate and equitable trust judgments by different users, and how can we protect users from deceptive attempts to earn their trust? we draw from communication theories and literature on trust in technologies to develop a conceptual model called match, which describes how trustworthiness is communicated in ai systems through trustworthiness cues and how those cues are processed by people to make trust judgments. besides ai-generated content, we highlight transparency and interaction as ai systems' affordances that present a wide range of trustworthiness cues to users. by bringing to light the variety of users' cognitive processes to make trust judgments and their potential limitations, we urge technology creators to make conscious decisions in choosing reliable trustworthiness cues for target users and, as an industry, to regulate this space and prevent malicious use. towards these goals, we define the concepts of warranted trustworthiness cues and expensive trustworthiness cues, and propose a checklist of requirements to help technology creators identify appropriate cues to use. we present a hypothetical use case to illustrate how practitioners can use match to design ai systems responsibly, and discuss future directions for research and industry efforts aimed at promoting responsible trust in ai.

2022-04-26

Haoran Li, Yangqiu Song, Lixin Fan
Abstract: social chatbots, also known as chit-chat chatbots, evolve rapidly with large pretrained language models. despite the huge progress, privacy concerns have arisen recently: training data of large language models can be extracted via model inversion attacks. on the other hand, the datasets used for training chatbots contain many private conversations between two individuals. in this work, we further investigate the privacy leakage of the hidden states of chatbots trained by language modeling which has not been well studied yet. we show that speakers' personas can be inferred through a simple neural network with high accuracy. to this end, we propose effective defense objectives to protect persona leakage from hidden states. we conduct extensive experiments to demonstrate that our proposed defense objectives can greatly reduce the attack accuracy from 37.6% to 0.5%. meanwhile, the proposed objectives preserve language models' powerful generation ability.

2022-04-25

Vivian Lai, Samuel Carton, Rajat Bhatnagar, Q. Vera Liao, Yunfeng Zhang, Chenhao Tan
Abstract: despite impressive performance in many benchmark datasets, ai models can still make mistakes, especially among out-of-distribution examples. it remains an open question how such imperfect models can be used effectively in collaboration with humans. prior work has focused on ai assistance that helps people make individual high-stakes decisions, which is not scalable for a large amount of relatively low-stakes decisions, e.g., moderating social media comments. instead, we propose conditional delegation as an alternative paradigm for human-ai collaboration where humans create rules to indicate trustworthy regions of a model. using content moderation as a testbed, we develop novel interfaces to assist humans in creating conditional delegation rules and conduct a randomized experiment with two datasets to simulate in-distribution and out-of-distribution scenarios. our study demonstrates the promise of conditional delegation in improving model performance and provides insights into design for this novel paradigm, including the effect of ai explanations.

2022-04-21

Sebastian Farquhar, Ryan Carey, Tom Everitt
Abstract: we present a general framework for training safe agents whose naive incentives are unsafe. as an example, manipulative or deceptive behaviour can improve rewards but should be avoided. most approaches fail here: agents maximize expected return by any means necessary. we formally describe settings with 'delicate' parts of the state which should not be used as a means to an end. we then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state, using causal influence diagram analysis. the resulting agents have no incentive to control the delicate state. we further show how our framework unifies and generalizes existing proposals.

2022-04-20

Richard Plant, Valerio Giuffrida, Dimitra Gkatzia
Abstract: large scale adoption of large language models has introduced a new era of convenient knowledge transfer for a slew of natural language processing tasks. however, these models also run the risk of undermining user trust by exposing unwanted information about the data subjects, which may be extracted by a malicious party, e.g. through adversarial attacks. we present an empirical investigation into the extent of the personal information encoded into pre-trained representations by a range of popular models, and we show a positive correlation between the complexity of a model, the amount of data used in pre-training, and data leakage. in this paper, we present the first wide coverage evaluation and comparison of some of the most popular privacy-preserving algorithms, on a large, multi-lingual dataset on sentiment analysis annotated with demographic information (location, age and gender). the results show since larger and more complex models are more prone to leaking private information, use of privacy-preserving methods is highly desirable. we also find that highly privacy-preserving technologies like differential privacy (dp) can have serious model utility effects, which can be ameliorated using hybrid or metric-dp techniques.
Samson Tan, Araz Taeihagh, Kathy Baxter
Abstract: the speed and scale at which machine learning (ml) systems are deployed are accelerating even as an increasing number of studies highlight their potential for negative impact. there is a clear need for companies and regulators to manage the risk from proposed ml systems before they harm people. to achieve this, private and public sector actors first need to identify the risks posed by a proposed ml system. a system's overall risk is influenced by its direct and indirect effects. however, existing frameworks for ml risk/impact assessment often address an abstract notion of risk or do not concretize this dependence. we propose to address this gap with a context-sensitive framework for identifying ml system risks comprising two components: a taxonomy of the first- and second-order risks posed by ml systems, and their contributing factors. first-order risks stem from aspects of the ml system, while second-order risks stem from the consequences of first-order risks. these consequences are system failures that result from design and development choices. we explore how different risks may manifest in various types of ml systems, the factors that affect each risk, and how first-order risks may lead to second-order effects when the system interacts with the real world. throughout the paper, we show how real events and prior research fit into our machine learning system risk framework (mlsr). mlsr operates on ml systems rather than technologies or domains, recognizing that a system's design, implementation, and use case all contribute to its risk. in doing so, it unifies the risks that are commonly discussed in the ethical ai community (e.g., ethical/human rights risks) with system-level risks (e.g., application, design, control risks), paving the way for holistic risk assessments of ml systems.

2022-04-15

Weiyan Shi, Ryan Shea, Si Chen, Chiyuan Zhang, Ruoxi Jia, Zhou Yu
Abstract: protecting large language models from privacy leakage is becoming increasingly crucial with their wide adoption in real-world products. yet applying differential privacy (dp), a canonical notion with provable privacy guarantees for machine learning models, to those models remains challenging due to the trade-off between model utility and privacy loss. utilizing the fact that sensitive information in language data tends to be sparse, shi et al. (2021) formalized a dp notion extension called selective differential privacy (sdp) to protect only the sensitive tokens defined by a policy function. however, their algorithm only works for rnn-based models. in this paper, we develop a novel framework, just fine-tune twice (jft), that achieves sdp for state-of-the-art large transformer-based models. our method is easy to implement: it first fine-tunes the model with redacted in-domain data, and then fine-tunes it again with the original in-domain data using a private training mechanism. furthermore, we study the scenario of imperfect implementation of policy functions that misses sensitive tokens and develop systematic methods to handle it. experiments show that our method achieves strong utility compared to previous baselines. we also analyze the sdp privacy guarantee empirically with the canary insertion attack.

2022-04-14

Hadas Orgad, Seraphina Goldfarb-Tarrant, Yonatan Belinkov
Abstract: common studies of gender bias in nlp focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models' internal representations. however, the relationship between extrinsic and intrinsic bias is relatively unknown. in this work, we illuminate this relationship by measuring both quantities together: we debias a model during downstream fine-tuning, which reduces extrinsic bias, and measure the effect on intrinsic bias, which is operationalized as bias extractability with information-theoretic probing. through experiments on two tasks and multiple bias metrics, we show that our intrinsic bias metric is a better indicator of debiasing than (a contextual adaptation of) the standard weat metric, and can also expose cases of superficial debiasing. our framework provides a comprehensive perspective on bias in nlp models, which can be applied to deploy nlp systems in a more informed manner. our code and model checkpoints are publicly available.
Apoorv Garg, Deval Srivastava, Zhiyang Xu, Lifu Huang
Abstract: due to the superior performance, large-scale pre-trained language models (plms) have been widely adopted in many aspects of human society. however, we still lack effective tools to understand the potential bias embedded in the black-box models. recent advances in prompt tuning show the possibility to explore the internal mechanism of the plms. in this work, we propose two token-level sentiment tests: sentiment association test (sat) and sentiment shift test (sst) which utilize the prompt as a probe to detect the latent bias in the plms. our experiments on the collection of sentiment datasets show that both sat and sst can identify sentiment bias in plms and sst is able to quantify the bias. the results also suggest that fine-tuning can possibly augment the existing bias in plms.

2022-04-13

Souvic Chakraborty, Parag Dutta, Sumegh Roychowdhury, Animesh Mukherjee
Abstract: the last decade has witnessed a surge in the interaction of people through social networking platforms. while there are several positive aspects of these social platforms, the proliferation has led them to become the breeding ground for cyber-bullying and hate speech. recent advances in nlp have often been used to mitigate the spread of such hateful content. since the task of hate speech detection is usually applicable in the context of social networks, we introduce crush, a framework for hate speech detection using user-anchored self-supervision and contextual regularization. our proposed approach secures ~ 1-12% improvement in test set metrics over best performing previous approaches on two types of tasks and multiple popular english social media datasets.

2022-04-12

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam Mccandlish, Chris Olah, Ben Mann, Jared Kaplan
Abstract: we apply preference modeling and reinforcement learning from human feedback (rlhf) to finetune language models to act as helpful and harmless assistants. we find this alignment training improves performance on almost all nlp evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. we explore an iterated online mode of training, where preference models and rl policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. finally, we investigate the robustness of rlhf training, and identify a roughly linear relation between the rl reward and the square root of the kl divergence between the policy and its initialization. alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of ood detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

2022-04-08

Carolin Holtermann, Anne Lauscher, Simone Paolo Ponzetto
Abstract: although much work in nlp has focused on measuring and mitigating stereotypical bias in semantic spaces, research addressing bias in computational argumentation is still in its infancy. in this paper, we address this research gap and conduct a thorough investigation of bias in argumentative language models. to this end, we introduce abba, a novel resource for bias measurement specifically tailored to argumentation. we employ our resource to assess the effect of argumentative fine-tuning and debiasing on the intrinsic bias found in transformer-based language models using a lightweight adapter-based approach that is more sustainable and parameter-efficient than full fine-tuning. finally, we analyze the potential impact of language model debiasing on the performance in argument quality prediction, a downstream task of computational argumentation. our results show that we are able to successfully and sustainably remove bias in general and argumentative language models while preserving (and sometimes improving) model performance in downstream tasks. we make all experimental code and data available at https://github.com/umanlp/fairargumentativelm.
Pedro Henrique Luz De Araujo, Benjamin Roth
Abstract: behavioural testing -- verifying system capabilities by validating human-designed input-output pairs -- is an alternative evaluation method of natural language processing systems proposed to address the shortcomings of the standard approach: computing metrics on held-out data. while behavioural tests capture human prior knowledge and insights, there has been little exploration on how to leverage them for model training and development. with this in mind, we explore behaviour-aware learning by examining several fine-tuning schemes using hatecheck, a suite of functional tests for hate speech detection systems. to address potential pitfalls of training on data originally intended for evaluation, we train and evaluate models on different configurations of hatecheck by holding out categories of test cases, which enables us to estimate performance on potentially overlooked system properties. the fine-tuning procedure led to improvements in the classification accuracy of held-out functionalities and identity groups, suggesting that models can potentially generalise to overlooked functionalities. however, performance on held-out functionality classes and i.i.d. hate speech detection data decreased, which indicates that generalisation occurs mostly across functionalities from the same class and that the procedure led to overfitting to the hatecheck data distribution.

2022-04-07

Siddhartha Datta, Konrad Kollnig, Nigel Shadbolt
Abstract: digital harms can manifest across any interface. key problems in addressing these harms include the high individuality of harms and the fast-changing nature of digital systems. as a result, we still lack a systematic approach to study harms and produce interventions for end-users. we put forward greasevision, a new framework that enables end-users to collaboratively develop interventions against harms in software using a no-code approach and recent advances in few-shot machine learning. the contribution of the framework and tool allow individual end-users to study their usage history and create personalized interventions. our contribution also enables researchers to study the distribution of harms and interventions at scale.

2022-04-06

Ehsan Aghaei, Xi Niu, Waseem Shadid, Ehab Al-Shaer
Abstract: natural language processing (nlp) has recently gained wide attention in cybersecurity, particularly in cyber threat intelligence (cti) and cyber automation. increased connection and automation have revolutionized the world's economic and cultural infrastructures, while they have introduced risks in terms of cyber attacks. cti is information that helps cybersecurity analysts make intelligent security decisions, that is often delivered in the form of natural language text, which must be transformed to machine readable format through an automated procedure before it can be used for automated security measures. this paper proposes securebert, a cybersecurity language model capable of capturing text connotations in cybersecurity text (e.g., cti) and therefore successful in automation for many critical cybersecurity tasks that would otherwise rely on human expertise and time-consuming manual efforts. securebert has been trained using a large corpus of cybersecurity text.to make securebert effective not just in retaining general english understanding, but also when applied to text with cybersecurity implications, we developed a customized tokenizer as well as a method to alter pre-trained weights. the securebert is evaluated using the standard masked language model (mlm) test as well as two additional standard nlp tasks. our evaluation studies show that securebert\footnote{\url{https://github.com/ehsanaghaei/securebert}} outperforms existing similar models, confirming its capability for solving crucial nlp tasks in cybersecurity.
Caleb Ziems, Jane A. Yu, Yi-Chia Wang, Alon Halevy, Diyi Yang
Abstract: conversational agents have come increasingly closer to human competence in open-domain dialogue settings; however, such models can reflect insensitive, hurtful, or entirely incoherent viewpoints that erode a user's trust in the moral integrity of the system. moral deviations are difficult to mitigate because moral judgments are not universal, and there may be multiple competing judgments that apply to a situation simultaneously. in this work, we introduce a new resource, not to authoritatively resolve moral ambiguities, but instead to facilitate systematic understanding of the intuitions, values and moral judgments reflected in the utterances of dialogue systems. the moral integrity corpus, mic, is such a resource, which captures the moral assumptions of 38k prompt-reply pairs, using 99k distinct rules of thumb (rots). each rot reflects a particular moral conviction that can explain why a chatbot's reply may appear acceptable or problematic. we further organize rots with a set of 9 moral and social attributes and benchmark performance for attribute classification. most importantly, we show that current neural language models can automatically generate new rots that reasonably describe previously unseen interactions, but they still struggle with certain scenarios. our findings suggest that mic will be a useful resource for understanding and language models' implicit moral assumptions and flexibly benchmarking the integrity of conversational agents. to download the data, see https://github.com/gt-salt/mic

2022-04-05

Isar Nejadgholi, Kathleen C. Fraser, Svetlana Kiritchenko
Abstract: robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. new kinds of abusive language continually emerge in online discussions in response to current events (e.g., covid-19), and the deployed abuse detection systems should be updated regularly to remain accurate. in this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. next, we propose an interpretability technique, based on the testing concept activation vector (tcav) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, covid-related anti-asian hate speech. extending this technique, we introduce a novel metric, degree of explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel
Abstract: large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. to further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, transformer language model, which we call pathways language model palm. we trained palm on 6144 tpu v4 chips using pathways, a new ml system which enables highly efficient training across multiple tpu pods. we demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. on a number of these tasks, palm 540b achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released big-bench benchmark. a significant number of big-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. palm also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. we additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

2022-04-03

Mahima Pushkarna, Andrew Zaldivar, Oddur Kjartansson
Abstract: as research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. a clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. however, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. it requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. in this paper, we propose data cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. data cards are structured summaries of essential facts about various aspects of ml datasets needed by stakeholders across a dataset's lifecycle for responsible ai development. these summaries provide explanations of processes and rationales that shape the data and consequently the models, such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. we also present frameworks that ground data cards in real-world utility and human-centricity. using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. finally, we present lessons learned from deploying over 20 data cards.

2022-03-31

Fei Mi, Yitong Li, Yulong Zeng, Jingyan Zhou, Yasheng Wang, Chuanfei Xu, Lifeng Shang, Xin Jiang, Shiqi Zhao, Qun Liu
Abstract: in this paper, we introduce pangu-bot, a chinese pre-trained open-domain dialogue generation model based on a large pre-trained language model (plm) pangu-alpha (zeng et al.,2021). different from other pre-trained dialogue models trained over a massive amount of dialogue data from scratch, we aim to build a powerful dialogue model with relatively fewer data and computation costs by inheriting valuable language capabilities and knowledge from plms. to this end, we train pangu-bot from the large plm pangu-alpha, which has been proven well-performed on a variety of chinese natural language tasks. we investigate different aspects of responses generated by pangu-bot, including response quality, knowledge, and safety. we show that pangu-bot outperforms state-of-the-art chinese dialogue systems (cdialgpt (wang et al., 2020), eva (zhou et al., 2021), eva2.0 (gu et al., 2022)) w.r.t. the above three aspects. we also demonstrate that pangu-bot can be easily deployed to generate emotional responses without further training. throughout our empirical analysis, we also point out that the pangu-bot response quality, knowledge correctness, and safety are still far from perfect, and further explorations are indispensable to building reliable and smart dialogue systems. our model and code will be available at https://github.com/huawei-noah/pretrained-language-model/tree/master/pangu-bot soon.

2022-03-29

Yusuke Hirota, Yuta Nakashima, Noa Garcia
Abstract: we study societal bias amplification in image captioning. image captioning models have been shown to perpetuate gender and racial biases, however, metrics to measure, quantify, and evaluate the societal bias in captions are not yet standardized. we provide a comprehensive study on the strengths and limitations of each metric, and propose lic, a metric to study captioning bias amplification. we argue that, for image captioning, it is not enough to focus on the correct prediction of the protected attribute, and the whole context should be taken into account. we conduct extensive evaluation on traditional and state-of-the-art image captioning models, and surprisingly find that, by only focusing on the protected attribute prediction, bias mitigation models are unexpectedly amplifying bias.

2022-03-25

Yang Trista Cao, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta, Varun Kumar, Jwala Dhamala, Aram Galstyan
Abstract: multiple metrics have been introduced to measure fairness in various natural language processing tasks. these metrics can be roughly categorized into two categories: 1) \emph{extrinsic metrics} for evaluating fairness in downstream applications and 2) \emph{intrinsic metrics} for estimating fairness in upstream contextualized language representation models. in this paper, we conduct an extensive correlation study between intrinsic and extrinsic metrics across bias notions using 19 contextualized language models. we find that intrinsic and extrinsic metrics do not necessarily correlate in their original setting, even when correcting for metric misalignments, noise in evaluation datasets, and confounding factors such as experiment configuration for extrinsic metrics. %al

2022-03-23

Boxi Cao, Hongyu Lin, Xianpei Han, Fangchao Liu, Le Sun
Abstract: prompt-based probing has been widely used in evaluating the abilities of pretrained language models (plms). unfortunately, recent studies have discovered such an evaluation may be inaccurate, inconsistent and unreliable. furthermore, the lack of understanding its inner workings, combined with its wide applicability, has the potential to lead to unforeseen risks for evaluating and applying plms in real-world applications. to discover, understand and quantify the risks, this paper investigates the prompt-based probing from a causal view, highlights three critical biases which could induce biased results and conclusions, and proposes to conduct debiasing via causal intervention. this paper provides valuable insights for the design of unbiased datasets, better probing frameworks and more reliable evaluations of pretrained language models. furthermore, our conclusions also echo that we need to rethink the criteria for identifying better pretrained language models. we openly released the source code and data at https://github.com/c-box/causaleval.
Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan
Abstract: language models excel at generating coherent text, and model compression techniques such as knowledge distillation have enabled their use in resource-constrained settings. however, these models can be biased in multiple ways, including the unfounded association of male and female genders with gender-neutral professions. therefore, knowledge distillation without any fairness constraints may preserve or exaggerate the teacher model's biases onto the distilled model. to this end, we present a novel approach to mitigate gender disparity in text generation by learning a fair model during knowledge distillation. we propose two modifications to the base knowledge distillation based on counterfactual role reversal$\unicode{x2014}$modifying teacher probabilities and augmenting the training set. we evaluate gender polarity across professions in open-ended text generated from the resulting distilled and finetuned gpt$\unicode{x2012}$2 models and demonstrate a substantial reduction in gender disparity with only a minor compromise in utility. finally, we observe that language models that reduce gender polarity in language generation do not improve embedding fairness or downstream classification fairness.

2022-03-22

Hugo Berg, Siobhan Mackenzie Hall, Yash Bhalgat, Wonsuk Yang, Hannah Rose Kirk, Aleksandar Shtedritski, Max Bain
Abstract: vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. to address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. we then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss reduces various bias measures with minimal degradation to the image-text representation.

2022-03-21

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, Nat Mcaleese
Abstract: recent large language models often answer factual questions correctly. but users can't trust any given claim a model makes without fact-checking, because language models can hallucinate convincing nonsense. in this work we use reinforcement learning from human preferences (rlhp) to train "open-book" qa models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness. supporting evidence is drawn from multiple documents found via a search engine, or from a single user-provided document. our 280 billion parameter model, gophercite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure. we measure the performance of gophercite by conducting human evaluation of answers to questions in a subset of the naturalquestions and eli5 datasets. the model's response is found to be high-quality 80\% of the time on this natural questions subset, and 67\% of the time on the eli5 subset. abstaining from the third of questions for which it is most unsure improves performance to 90\% and 80\% respectively, approaching human baselines. however, analysis on the adversarial truthfulqa dataset shows why citation is only one part of an overall strategy for safety and trustworthiness: not all claims supported by evidence are true.
Taylor Sorensen, Joshua Robinson, Christopher Michael Rytting, Alexander Glenn Shaw, Kyle Jeffrey Rogers, Alexia Pauline Delorey, Mahmoud Khalil, Nancy Fulda, David Wingate
Abstract: pre-trained language models derive substantial linguistic and factual knowledge from the massive corpora on which they are trained, and prompt engineering seeks to align these models to specific tasks. unfortunately, existing prompt engineering methods require significant amounts of labeled data, access to model parameters, or both. we introduce a new method for selecting prompt templates \textit{without labeled examples} and \textit{without direct access to the model}. specifically, over a set of candidate templates, we choose the template that maximizes the mutual information between the input and the corresponding model output. across 8 datasets representing 7 distinct nlp tasks, we show that when a template has high mutual information, it also has high accuracy on the task. on the largest model, selecting prompts with our method gets 90\% of the way from the average prompt accuracy to the best prompt accuracy and requires no ground truth labels.

2022-03-20

Eve Fleisig, Christiane Fellbaum
Abstract: machine translation and other nlp systems often contain significant biases regarding sensitive attributes, such as gender or race, that worsen system performance and perpetuate harmful stereotypes. recent preliminary research suggests that adversarial learning can be used as part of a model-agnostic bias mitigation method that requires no data modifications. however, adapting this strategy for machine translation and other modern nlp domains requires (1) restructuring training objectives in the context of fine-tuning pretrained large language models and (2) developing measures for gender or other protected variables for tasks in which these attributes must be deduced from the data itself. we present an adversarial learning framework that addresses these challenges to mitigate gender bias in seq2seq machine translation. our framework improves the disparity in translation quality for sentences with male vs. female entities by 86% for english-german translation and 91% for english-french translation, with minimal effect on translation quality. the results suggest that adversarial learning is a promising technique for mitigating gender bias in machine translation.
Trisha Chakraborty, Shaswata Mitra, Sudip Mittal, Maxwell Young
Abstract: proof of work (pow) based cyberdefense systems require incoming network requests to expend effort solving an arbitrary mathematical puzzle. current state of the art is unable to differentiate between trustworthy and untrustworthy connections, requiring all to solve complex puzzles. in this paper, we introduce an artificial intelligence (ai)-assisted pow framework that utilizes ip traffic based features to inform an adaptive issuer which can then generate puzzles with varying hardness. the modular framework uses these capabilities to ensure that untrustworthy clients solve harder puzzles thereby incurring longer latency than authentic requests to receive a response from the server. our preliminary findings reveal our approach effectively throttles untrustworthy traffic.

2022-03-18

Katharina Hämmerl, Björn Deiseroth, Patrick Schramowski, Jindřich Libovický, Alexander Fraser, Kristian Kersting
Abstract: massively multilingual sentence representations are trained on large corpora of uncurated data, with a very imbalanced proportion of languages included in the training. this may cause the models to grasp cultural values including moral judgments from the high-resource languages and impose them on the low-resource languages. the lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs. both these issues can negatively influence zero-shot cross-lingual model transfer and potentially lead to harmful outcomes. therefore, we aim to (1) detect and quantify these issues by comparing different models in different languages, (2) develop methods for improving undesirable properties of the models. our initial experiments using the multilingual model xlm-r show that indeed multilingual lms capture moral norms, even with potentially higher human-agreement than monolingual ones. however, it is not yet clear to what extent these moral norms differ between languages.

2022-03-17

Giuseppe Attanasio, Debora Nozza, Dirk Hovy, Elena Baralis
Abstract: natural language processing (nlp) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. e.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance. most mitigation techniques use lists of identity terms or samples from the target domain during training. however, this approach requires a-priori knowledge and introduces further bias if important terms are neglected. instead, we propose a knowledge-free entropy-based attention regularization (ear) to discourage overfitting to training-specific terms. an additional objective function penalizes tokens with low self-attention entropy. we fine-tune bert via ear: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in english and italian. ear also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar
Abstract: toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. to help mitigate these issues, we create toxigen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. we develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained language model. controlling machine generation in this way allows toxigen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. we conduct a human evaluation on a challenging subset of toxigen and find that annotators struggle to distinguish machine-generated text from human-written language. we also find that 94.5% of toxic examples are labeled as hate speech by human annotators. using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. we also demonstrate that toxigen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset. our code and data can be found at https://github.com/microsoft/toxigen.

2022-03-15

Rebecca L Johnson, Giada Pistilli, Natalia Menédez-González, Leslye Denisse Dias Duran, Enrico Panai, Julija Kalpokiene, Donald Jay Bertulfo
Abstract: the alignment problem in the context of large language models must consider the plurality of human values in our world. whilst there are many resonant and overlapping values amongst the world's cultures, there are also many conflicting, yet equally valid, values. it is important to observe which cultural values a model exhibits, particularly when there is a value conflict between input prompts and generated outputs. we discuss how the co-creation of language and cultural value impacts large language models (llms). we explore the constitution of the training data for gpt-3 and compare that to the world's language and internet access demographics, as well as to reported statistical profiles of dominant values in some nation-states. we stress tested gpt-3 with a range of value-rich texts representing several languages and nations; including some with values orthogonal to dominant us public opinion as reported by the world values survey. we observed when values embedded in the input text were mutated in the generated outputs and noted when these conflicting values were more aligned with reported dominant us values. our discussion of these results uses a moral value pluralism (mvp) lens to better understand these value mutations. finally, we provide recommendations for how our work may contribute to other current work in the field.

2022-03-11

Xudong Han, Timothy Baldwin, Trevor Cohn
Abstract: adversarial training is a common approach for bias mitigation in natural language processing. although most work on debiasing is motivated by equal opportunity, it is not explicitly captured in standard adversarial training. in this paper, we propose an augmented discriminator for adversarial training, which takes the target class as input to create richer features and more explicitly model equal opportunity. experimental results over two datasets show that our method substantially improves over standard adversarial debiasing methods, in terms of the performance--fairness trade-off.

2022-03-09

Masashi Takeshita, Rafal Rzepka, Kenji Araki
Abstract: various existing studies have analyzed what social biases are inherited by nlp models. these biases may directly or indirectly harm people, therefore previous studies have focused only on human attributes. however, until recently no research on social biases in nlp regarding nonhumans existed. in this paper, we analyze biases to nonhuman animals, i.e. speciesist bias, inherent in english masked language models such as bert. we analyzed speciesist bias against 46 animal names using template-based and corpus-extracted sentences containing speciesist (or non-speciesist) language. we found that pre-trained masked language models tend to associate harmful words with nonhuman animals and have a bias toward using speciesist language for some nonhuman animal names. our code for reproducing the experiments will be made available on github.

2022-03-08

Shalini Saini, Nitesh Saxena
Abstract: medical artificial intelligence (medai) for diagnosis, treatment options, and drug development represents the new age of healthcare. the security, integrity, and credibility of medai tools are paramount issues because human lives are at stake. medai solutions are often heavily dependent on scientific medical research literature as a primary data source that draws the attacker's attention as a potential target. we present a first study of how the output of medai can be polluted with predatory publications presence (ppp). we study two medai systems: medikanren (disease independent) and cancermine (disease-specific), which use research literature as primary data input from the research repository pubmed, pubmed derived database semmeddb, and nih translational knowledge graphs (kgs). our study has a three-pronged focus: (1) identifying the ppp in pubmed; (2) verifying the ppp in semmeddb and the kgs; (3) demonstrating the existing vulnerability of ppp traversing to the medai output. our contribution lies in identifying the existing ppp in the medai inputs and demonstrating how predatory science can jeopardize the credibility of medai solutions, making their real-life deployment questionable.

2022-03-06

Erick Galinkin
Abstract: legislation and public sentiment throughout the world have promoted fairness metrics, explainability, and interpretability as prescriptions for the responsible development of ethical artificial intelligence systems. despite the importance of these three pillars in the foundation of the field, they can be challenging to operationalize and attempts to solve the problems in production environments often feel sisyphean. this difficulty stems from a number of factors: fairness metrics are computationally difficult to incorporate into training and rarely alleviate all of the harms perpetrated by these systems. interpretability and explainability can be gamed to appear fair, may inadvertently reduce the privacy of personal information contained in training data, and increase user confidence in predictions -- even when the explanations are wrong. in this work, we propose a framework for responsibly developing artificial intelligence systems by incorporating lessons from the field of information security and the secure development lifecycle to overcome challenges associated with protecting users in adversarial settings. in particular, we propose leveraging the concepts of threat modeling, design review, penetration testing, and incident response in the context of developing ai systems as ways to resolve shortcomings in the aforementioned methods.
Canwen Xu, Zexue He, Zhankui He, Julian Mcauley
Abstract: language models (lms) can reproduce (or amplify) toxic language seen during training, which poses a risk to their practical application. in this paper, we conduct extensive experiments to study this phenomenon. we analyze the impact of prompts, decoding strategies and training corpora on the output toxicity. based on our findings, we propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator. compared to a supervised baseline, our proposed method shows better toxicity reduction with good generation quality in the generated content under multiple settings. warning: some examples shown in the paper may contain uncensored offensive content.
Rajas Bansal
Abstract: as nlp models become more integrated with the everyday lives of people, it becomes important to examine the social effect that the usage of these systems has. while these models understand language and have increased accuracy on difficult downstream tasks, there is evidence that these models amplify gender, racial and cultural stereotypes and lead to a vicious cycle in many settings. in this survey, we analyze the origins of biases, the definitions of fairness, and how different subfields of nlp mitigate bias. we finally discuss how future studies can work towards eradicating pernicious biases from nlp algorithms.

2022-03-04

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Abstract: making language models bigger does not inherently make them better at following a user's intent. for example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. in other words, these models are not aligned with their users. in this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. starting with a set of labeler-written prompts and prompts submitted through the openai api, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune gpt-3 using supervised learning. we then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. we call the resulting models instructgpt. in human evaluations on our prompt distribution, outputs from the 1.3b parameter instructgpt model are preferred to outputs from the 175b gpt-3, despite having 100x fewer parameters. moreover, instructgpt models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public nlp datasets. even though instructgpt still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

2022-03-02

Jacob Metcalf, Emanuel Moss, Ranjit Singh, Emnet Tafese, Elizabeth Anne Watkins
Abstract: central to a number of scholarly, regulatory, and public conversations about algorithmic accountability is the question of who should have access to documentation that reveals the inner workings, intended function, and anticipated consequences of algorithmic systems, potentially establishing new routes for impacted publics to contest the operations of these systems. currently, developers largely have a monopoly on information about how their systems actually work and are incentivized to maintain their own ignorance about aspects of how their systems affect the world. increasingly, legislators, regulators and advocates have turned to assessment documentation in order to address the gap between the public's experience of algorithmic harms and the obligations of developers to document and justify their design decisions. however, issues of standing and expertise currently prevent publics from cohering around shared interests in preventing and redressing algorithmic harms; as we demonstrate with multiple cases, courts often find computational harms non-cognizable and rarely require developers to address material claims of harm. constructed with a triadic accountability relationship, algorithmic impact assessment regimes could alter this situation by establishing procedural rights around public access to reporting and documentation. developing a relational approach to accountability, we argue that robust accountability regimes must establish opportunities for publics to cohere around shared experiences and interests, and to contest the outcomes of algorithmic systems that affect their lives. furthermore, algorithmic accountability policies currently under consideration in many jurisdictions must provide the public with adequate standing and opportunities to access and contest the documentation provided by the actors and the judgments passed by the forum.

2022-03-01

Benedikt Lorch, Nicole Scheler, Christian Riess
Abstract: in many applications of forensic image analysis, state-of-the-art results are nowadays achieved with machine learning methods. however, concerns about their reliability and opaqueness raise the question whether such methods can be used in criminal investigations. so far, this question of legal compliance has hardly been discussed, also because legal regulations for machine learning methods were not defined explicitly. to this end, the european commission recently proposed the artificial intelligence (ai) act, a regulatory framework for the trustworthy use of ai. under the draft ai act, high-risk ai systems for use in law enforcement are permitted but subject to compliance with mandatory requirements. in this paper, we review why the use of machine learning in forensic image analysis is classified as high-risk. we then summarize the mandatory requirements for high-risk ai systems and discuss these requirements in light of two forensic applications, license plate recognition and deep fake detection. the goal of this paper is to raise awareness of the upcoming legal requirements and to point out avenues for future research.
Furkan Gursoy, Ioannis A. Kakadiaris
Abstract: decisions impacting human lives are increasingly being made or assisted by automated decision-making algorithms. many of these algorithms process personal data for predicting recidivism, credit risk analysis, identifying individuals using face recognition, and more. while potentially improving efficiency and effectiveness, such algorithms are not inherently free from bias, opaqueness, lack of explainability, maleficence, and the like. given that the outcomes of these algorithms have a significant impact on individuals and society and are open to analysis and contestation after deployment, such issues must be accounted for before deployment. formal audits are a way of ensuring algorithms meet the appropriate accountability standards. this work, based on an extensive analysis of the literature and an expert focus group study, proposes a unifying framework for a system accountability benchmark for formal audits of artificial intelligence-based decision-aiding systems. this work also proposes system cards to serve as scorecards presenting the outcomes of such audits. it consists of 56 criteria organized within a four-by-four matrix composed of rows focused on (i) data, (ii) model, (iii) code, (iv) system, and columns focused on (a) development, (b) assessment, (c) mitigation, and (d) assurance. the proposed system accountability benchmark reflects the state-of-the-art developments for accountable systems, serves as a checklist for algorithm audits, and paves the way for sequential work in future research.

2022-02-28

Rebecca Gorman, Stuart Armstrong
Abstract: for an artificial intelligence (ai) to be aligned with human values (or human preferences), it must first learn those values. ai systems that are trained on human behavior, risk miscategorising human irrationalities as human values -- and then optimising for these irrationalities. simply learning human values still carries risks: ai learning them will inevitably also gain information on human irrationalities and human behaviour/policy. both of these can be dangerous: knowing human policy allows an ai to become generically more powerful (whether it is partially aligned or not aligned at all), while learning human irrationalities allows it to exploit humans without needing to provide value in return. this paper analyses the danger in developing artificial intelligence that learns about human irrationalities and human policy, and constructs a model recommendation system with various levels of information about human biases, human policy, and human values. it concludes that, whatever the power and knowledge of the ai, it is more dangerous for it to know human irrationalities than human values. thus it is better for the ai to learn human values directly, rather than learning human biases and then deducing values from behaviour.

2022-02-27

Abhishek Gupta, Iga Kozlowska, Nga Than
Abstract: this paper outlines a conceptual framework titled the golden circle that describes the roles of actors at individual, organizational, and societal levels, and their dynamics in the content moderation ecosystem. centering harm reduction and context moderation, it argues that the ml community must attend to multimodal content moderation solutions, align their work with their organizations' goals and values, and pay attention to the ever changing social contexts in which their sociotechnical systems are embedded. this is done by accounting for the why, how, and what of content moderation from a sociological and technical lens.

2022-02-22

Thilo Hagendorff, Leonie Bossert, Tse Yip Fai, Peter Singer
Abstract: massive efforts are made to reduce biases in both data and algorithms in order to render ai applications fair. these efforts are propelled by various high-profile cases where biased algorithmic decision-making caused harm to women, people of color, minorities, etc. however, the ai fairness field still succumbs to a blind spot, namely its insensitivity to discrimination against animals. this paper is the first to describe the 'speciesist bias' and investigate it in several different ai systems. speciesist biases are learned and solidified by ai applications when they are trained on datasets in which speciesist patterns prevail. these patterns can be found in image recognition systems, large language models, and recommender systems. therefore, ai technologies currently play a significant role in perpetuating and normalizing violence against animals. this can only be changed when ai fairness frameworks widen their scope and include mitigation measures for speciesist biases. this paper addresses the ai community in this regard and stresses the influence ai systems can have on either increasing or reducing the violence that is inflicted on animals, and especially on farmed animals.

2022-02-21

Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Ece Kamar
Abstract: in ai-assisted decision-making, effective hybrid (human-ai) teamwork is not solely dependent on ai performance alone, but also on its impact on human decision-making. while prior work studies the effects of model accuracy on humans, we endeavour here to investigate the complex dynamics of how both a model's predictive performance and bias may transfer to humans in a recommendation-aided decision task. we consider the domain of ml-assisted hiring, where humans -- operating in a constrained selection setting -- can choose whether they wish to utilize a trained model's inferences to help select candidates from written biographies. we conduct a large-scale user study leveraging a re-created dataset of real bios from prior work, where humans predict the ground truth occupation of given candidates with and without the help of three different nlp classifiers (random, bag-of-words, and deep neural network). our results demonstrate that while high-performance models significantly improve human performance in a hybrid setting, some models mitigate hybrid bias while others accentuate it. we examine these findings through the lens of decision conformity and observe that our model architecture choices have an impact on human-ai conformity and bias, motivating the explicit need to assess these complex dynamics prior to deployment.

2022-02-19

Farshid Faal, Ketra Schmitt, Jia Yuan Yu
Abstract: transformer-based language models are able to generate fluent text and be efficiently adapted across various natural language generation tasks. however, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. various detoxification methods were proposed to mitigate the language model's toxicity; however, these methods struggled to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. in this study, we propose reinforce-detoxify; a reinforcement learning-based method for mitigating toxicity in language models. we address the challenge of safety in language models and propose a new reward model that is able to detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. the experiments demonstrate that the reinforce-detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating the ability of our approach in language model detoxification and less prone to unintended bias toward social identities in generated content.

2022-02-18

Roel I. J. Dobbe
Abstract: this chapter formulates seven lessons for preventing harm in artificial intelligence (ai) systems based on insights from the field of system safety for software-based automation in safety-critical domains. new applications of ai across societal domains and public organizations and infrastructures come with new hazards, which lead to new forms of harm, both grave and pernicious. the text addresses the lack of consensus for diagnosing and eliminating new ai system hazards. for decades, the field of system safety has dealt with accidents and harm in safety-critical systems governed by varying degrees of software-based automation and decision-making. this field embraces the core assumption of systems and control that ai systems cannot be safeguarded by technical design choices on the model or algorithm alone, instead requiring an end-to-end hazard analysis and design frame that includes the context of use, impacted stakeholders and the formal and informal institutional environment in which the system operates. safety and other values are then inherently socio-technical and emergent system properties that require design and control measures to instantiate these across the technical, social and institutional components of a system. this chapter honors system safety pioneer nancy leveson, by situating her core lessons for today's ai system safety challenges. for every lesson, concrete tools are offered for rethinking and reorganizing the safety management of ai systems, both in design and governance. this history tells us that effective ai safety management requires transdisciplinary approaches and a shared language that allows involvement of all levels of society.
Mohamad Fazelnia, Igor Khokhlov, Mehdi Mirakhorli
Abstract: software systems are increasingly relying on artificial intelligence (ai) and machine learning (ml) components. the emerging popularity of ai techniques in various application domains attracts malicious actors and adversaries. therefore, the developers of ai-enabled software systems need to take into account various novel cyber-attacks and vulnerabilities that these systems may be susceptible to. this paper presents a framework to characterize attacks and weaknesses associated with ai-enabled systems and provide mitigation techniques and defense strategies. this framework aims to support software designers in taking proactive measures in developing ai-enabled software, understanding the attack surface of such systems, and developing products that are resilient to various emerging attacks associated with ml. the developed framework covers a broad spectrum of attacks, mitigation techniques, and defensive and offensive tools. in this paper, we demonstrate the framework architecture and its major components, describe their attributes, and discuss the long-term goals of this research.

2022-02-15

Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova Dassarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Dario Amodei, Tom Brown, Jared Kaplan, Sam Mccandlish, Chris Olah, Jack Clark
Abstract: large-scale pre-training has recently emerged as a technique for creating capable, general purpose, generative models such as gpt-3, megatron-turing nlg, gopher, and many others. in this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. we believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. we go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. we conclude with a list of possible interventions the ai community may take to increase the chance of these models having a beneficial impact. we intend this paper to be useful to policymakers who want to understand and regulate ai systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.
Zhichang Wang, Qipeng Zhu
Abstract: the detection and identification of toxic comments are conducive to creating a civilized and harmonious internet environment. in this experiment, we collected various data sets related to toxic comments. because of the characteristics of comment data, we perform data cleaning and feature extraction operations on it from different angles to obtain different toxic comment training sets. in terms of model construction, we used the training set to train the models based on tfidf and finetuned the bert model separately. finally, we encapsulated the code into software to score toxic comments in real-time.

2022-02-14

Patrick Schramowski, Christopher Tauchmann, Kristian Kersting
Abstract: large datasets underlying much of current machine learning raise serious issues concerning inappropriate content such as offensive, insulting, threatening, or might otherwise cause anxiety. this calls for increased dataset documentation, e.g., using datasheets. they, among other topics, encourage to reflect on the composition of the datasets. so far, this documentation, however, is done manually and therefore can be tedious and error-prone, especially for large image datasets. here we ask the arguably "circular" question of whether a machine can help us reflect on inappropriate content, answering question 16 in datasheets. to this end, we propose to use the information stored in pre-trained transformer models to assist us in the documentation process. specifically, prompt-tuning based on a dataset of socio-moral values steers clip to identify potentially inappropriate content, therefore reducing human labor. we then document the inappropriate images found using word clouds, based on captions generated using a vision-language model. the documentations of two popular, large-scale computer vision datasets -- imagenet and openimages -- produced this way suggest that machines can indeed help dataset creators to answer question 16 on inappropriate image content.
Shangwei Guo, Chunlong Xie, Jiwei Li, Lingjuan Lyu, Tianwei Zhang
Abstract: pre-trained language models (ptlms) have achieved great success and remarkable performance over a wide range of natural language processing (nlp) tasks. however, there are also growing concerns regarding the potential security issues in the adoption of ptlms. in this survey, we comprehensively systematize recently discovered threats to ptlm systems and applications. we perform our attack characterization from three interesting perspectives. (1) we show threats can occur at different stages of the ptlm pipeline raised by different malicious entities. (2) we identify two types of model transferability (landscape, portrait) that facilitate attacks. (3) based on the attack goals, we summarize four categories of attacks (backdoor, evasion, data privacy and model privacy). we also discuss some open problems and research directions. we believe our survey and taxonomy will inspire future studies towards secure and privacy-preserving ptlms.

2022-02-11

Thomas Krendl Gilbert, Sarah Dean, Tom Zick, Nathan Lambert
Abstract: in the long term, reinforcement learning (rl) is considered by many ai theorists to be the most promising path to artificial general intelligence. this places rl practitioners in a position to design systems that have never existed before and lack prior documentation in law and policy. public agencies could intervene on complex dynamics that were previously too opaque to deliberate about, and long-held policy ambitions would finally be made tractable. in this whitepaper we illustrate this potential and how it might be technically enacted in the domains of energy infrastructure, social media recommender systems, and transportation. alongside these unprecedented interventions come new forms of risk that exacerbate the harms already generated by standard machine learning tools. we correspondingly present a new typology of risks arising from rl design choices, falling under four categories: scoping the horizon, defining rewards, pruning information, and training multiple agents. rather than allowing rl systems to unilaterally reshape human domains, policymakers need new mechanisms for the rule of reason, foreseeability, and interoperability that match the risks these systems pose. we argue that criteria for these choices may be drawn from emerging subfields within antitrust, tort, and administrative law. it will then be possible for courts, federal and state agencies, and non-governmental organizations to play more active roles in rl specification and evaluation. building on the "model cards" and "datasheets" frameworks proposed by mitchell et al. and gebru et al., we argue the need for reward reports for ai systems. reward reports are living documents for proposed rl deployments that demarcate design choices.

2022-02-10

Max W. Shen
Abstract: the problem of human trust in artificial intelligence is one of the most fundamental problems in applied machine learning. our processes for evaluating ai trustworthiness have substantial ramifications for ml's impact on science, health, and humanity, yet confusion surrounds foundational concepts. what does it mean to trust an ai, and how do humans assess ai trustworthiness? what are the mechanisms for building trustworthy ai? and what is the role of interpretable ml in trust? here, we draw from statistical learning theory and sociological lenses on human-automation trust to motivate an ai-as-tool framework, which distinguishes human-ai trust from human-ai-human trust. evaluating an ai's contractual trustworthiness involves predicting future model behavior using behavior certificates (bcs) that aggregate behavioral evidence from diverse sources including empirical out-of-distribution and out-of-task evaluation and theoretical proofs linking model architecture to behavior. we clarify the role of interpretability in trust with a ladder of model access. interpretability (level 3) is not necessary or even sufficient for trust, while the ability to run a black-box model at-will (level 2) is necessary and sufficient. while interpretability can offer benefits for trust, it can also incur costs. we clarify ways interpretability can contribute to trust, while questioning the perceived centrality of interpretability to trust in popular discourse. how can we empower people with tools to evaluate trust? instead of trying to understand how a model works, we argue for understanding how a model behaves. instead of opening up black boxes, we should create more behavior certificates that are more correct, relevant, and understandable. we discuss how to build trusted and trustworthy ai responsibly.

2022-02-08

Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, Bryan Catanzaro
Abstract: pre-trained language models (lms) are shown to easily generate toxic language. in this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. we conduct this study on three dimensions: training corpus, model size, and parameter efficiency. for the training corpus, we propose to leverage the generative power of lms and generate nontoxic datasets for domain-adaptive training, which mitigates the exposure bias and is shown to be more data-efficient than using a curated pre-training corpus. we demonstrate that the self-generation method consistently outperforms the existing baselines across various model sizes on both automatic and human evaluations, even when it uses a 1/3 smaller training corpus. we then comprehensively study detoxifying lms with parameter sizes ranging from 126m up to 530b (3x larger than gpt-3), a scale that has never been studied before. we find that i) large lms have similar toxicity levels as smaller ones given the same pre-training corpus, and ii) large lms require more endeavor to detoxify. we also explore parameter-efficient training methods for detoxification. we demonstrate that adding and training adapter-only layers in lms not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.

2022-02-07

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat Mcaleese, Geoffrey Irving
Abstract: language models (lms) often cannot be deployed because of their potential to harm users in hard-to-predict ways. prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. however, human annotation is expensive, limiting the number and diversity of test cases. in this work, we automatically find cases where a target lm behaves in a harmful way, by generating test cases ("red teaming") using another lm. we evaluate the target lm's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280b parameter lm chatbot. we explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. furthermore, we use prompt engineering to control lm-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. overall, lm-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable lm behaviors before impacting users.

2022-02-06

David Leslie, Christopher Burr, Mhairi Aitken, Michael Katell, Morgan Briggs, Cami Rincon
Abstract: following on from the publication of its feasibility study in december 2020, the council of europe's ad hoc committee on artificial intelligence (cahai) and its subgroups initiated efforts to formulate and draft its possible elements of a legal framework on artificial intelligence, based on the council of europe's standards on human rights, democracy, and the rule of law. this document was ultimately adopted by the cahai plenary in december 2021. to support this effort, the alan turing institute undertook a programme of research that explored the governance processes and practical tools needed to operationalise the integration of human right due diligence with the assurance of trustworthy ai innovation practices. the resulting framework was completed and submitted to the council of europe in september 2021. it presents an end-to-end approach to the assurance of ai project lifecycles that integrates context-based risk analysis and appropriate stakeholder engagement with comprehensive impact assessment, and transparent risk management, impact mitigation, and innovation assurance practices. taken together, these interlocking processes constitute a human rights, democracy and the rule of law assurance framework (huderaf). the huderaf combines the procedural requirements for principles-based human rights due diligence with the governance mechanisms needed to set up technical and socio-technical guardrails for responsible and trustworthy ai innovation practices. its purpose is to provide an accessible and user-friendly set of mechanisms for facilitating compliance with a binding legal framework on artificial intelligence, based on the council of europe's standards on human rights, democracy, and the rule of law, and to ensure that ai innovation projects are carried out with appropriate levels of public accountability, transparency, and democratic governance.

2022-02-05

Arka Mitra, Priyanshu Sankhala
Abstract: the number of increased social media users has led to a lot of people misusing these platforms to spread offensive content and use hate speech. manual tracking the vast amount of posts is impractical so it is necessary to devise automated methods to identify them quickly. large language models are trained on a lot of data and they also make use of contextual embeddings. we fine-tune the large language models to help in our task. the data is also quite unbalanced; so we used a modified cross-entropy loss to tackle the issue. we observed that using a model which is fine-tuned in hindi corpora performs better. our team (hnlp) achieved the macro f1-scores of 0.808, 0.639 in english subtask a and english subtask b respectively. for hindi subtask a, hindi subtask b our team achieved macro f1-scores of 0.737, 0.443 respectively in hasoc 2021.
Laura Londoño, Adrian Röfer, Tim Welschehold, Abhinav Valada
Abstract: as robotic systems become more and more capable of assisting humans in their everyday lives, we must consider the opportunities for these artificial agents to make their human collaborators feel unsafe or to treat them unfairly. robots can exhibit antisocial behavior causing physical harm to people or reproduce unfair behavior replicating and even amplifying historical and societal biases which are detrimental to humans they interact with. in this paper, we discuss these issues considering sociable robotic manipulation and fair robotic decision making. we propose a novel approach to learning fair and sociable behavior, not by reproducing positive behavior, but rather by avoiding negative behavior. in this study, we highlight the importance of incorporating sociability in robot manipulation, as well as the need to consider fairness in human-robot interactions.

2022-02-04

Anu K. Myne, Kevin J. Leahy, Ryan J. Soklaski
Abstract: the state of artificial intelligence technology has a rich history that dates back decades and includes two fall-outs before the explosive resurgence of today, which is credited largely to data-driven techniques. while ai technology has and continues to become increasingly mainstream with impact across domains and industries, it's not without several drawbacks, weaknesses, and potential to cause undesired effects. ai techniques are numerous with many approaches and variants, but they can be classified simply based on the degree of knowledge they capture and how much data they require; two broad categories emerge as prominent across ai to date: (1) techniques that are primarily, and often solely, data-driven while leveraging little to no knowledge and (2) techniques that primarily leverage knowledge and depend less on data. now, a third category is starting to emerge that leverages both data and knowledge, that some refer to as "informed ai." this third category can be a game changer within the national security domain where there is ample scientific and domain-specific knowledge that stands ready to be leveraged, and where purely data-driven ai can lead to serious unwanted consequences. this report shares findings from a thorough exploration of ai approaches that exploit data as well as principled and/or practical knowledge, which we refer to as "knowledge-integrated informed ai." specifically, we review illuminating examples of knowledge integrated in deep learning and reinforcement learning pipelines, taking note of the performance gains they provide. we also discuss an apparent trade space across variants of knowledge-integrated informed ai, along with observed and prominent issues that suggest worthwhile future research directions. most importantly, this report suggests how the advantages of knowledge-integrated informed ai stand to benefit the national security domain.

2022-01-27

Raphael Koster, Jan Balaguer, Andrea Tacchetti, Ari Weinstein, Tina Zhu, Oliver Hauser, Duncan Williams, Lucy Campbell-Gillingham, Phoebe Thacker, Matthew Botvinick, Christopher Summerfield
Abstract: building artificial intelligence (ai) that aligns with human values is an unsolved problem. here, we developed a human-in-the-loop research pipeline called democratic ai, in which reinforcement learning is used to design a social mechanism that humans prefer by majority. a large group of humans played an online investment game that involved deciding whether to keep a monetary endowment or to share it with others for collective benefit. shared revenue was returned to players under two different redistribution mechanisms, one designed by the ai and the other by humans. the ai discovered a mechanism that redressed initial wealth imbalance, sanctioned free riders, and successfully won the majority vote. by optimizing for human preferences, democratic ai may be a promising method for value-aligned policy innovation.

2022-01-26

Alexander Kott, Paul Theron
Abstract: today's cyber defense tools are mostly watchers. they are not active doers. to be sure, watching too is a demanding affair. these tools monitor the traffic and events; they detect malicious signatures, patterns and anomalies; they might classify and characterize what they observe; they issue alerts, and they might even learn while doing all this. but they don't act. they do little to plan and execute responses to attacks, and they don't plan and execute recovery activities. response and recovery - core elements of cyber resilience are left to the human cyber analysts, incident responders and system administrators. we believe things should change. cyber defense tools should not be merely watchers. they need to become doers - active fighters in maintaining a system's resilience against cyber threats. this means that their capabilities should include a significant degree of autonomy and intelligence for the purposes of rapid response to a compromise - either incipient or already successful - and rapid recovery that aids the resilience of the overall system. often, the response and recovery efforts need to be undertaken in absence of any human involvement, and with an intelligent consideration of risks and ramifications of such efforts. recently an international team published a report that proposes a vision of an autonomous intelligent cyber defense agent (aica) and offers a high-level reference architecture of such an agent. in this paper we explore this vision.
Stephanie Galaitsi, Benjamin D. Trump, Jeffrey M. Keisler, Igor Linkov, Alexander Kott
Abstract: to benefit from ai advances, users and operators of ai systems must have reason to trust it. trust arises from multiple interactions, where predictable and desirable behavior is reinforced over time. providing the system's users with some understanding of ai operations can support predictability, but forcing ai to explain itself risks constraining ai capabilities to only those reconcilable with human cognition. we argue that ai systems should be designed with features that build trust by bringing decision-analytic perspectives and formal tools into ai. instead of trying to achieve explainable ai, we should develop interpretable and actionable ai. actionable and interpretable ai (ai2) will incorporate explicit quantifications and visualizations of user confidence in ai recommendations. in doing so, it will allow examining and testing of ai system predictions to establish a basis for trust in the systems' decision making and ensure broad benefits from deploying and advancing its computational capabilities.
Alexandre K. Ligo, Alexander Kott, Igor Linkov
Abstract: from denial-of-service attacks to spreading of ransomware or other malware across an organization's network, it is possible that manually operated defenses are not able to respond in real time at the scale required, and when a breach is detected and remediated the damage is already made. autonomous cyber defenses therefore become essential to mitigate the risk of successful attacks and their damage, especially when the response time, effort and accuracy required in those defenses is impractical or impossible through defenses operated exclusively by humans. autonomous agents have the potential to use ml with large amounts of data about known cyberattacks as input, in order to learn patterns and predict characteristics of future attacks. moreover, learning from past and present attacks enable defenses to adapt to new threats that share characteristics with previous attacks. on the other hand, autonomous cyber defenses introduce risks of unintended harm. actions arising from autonomous defense agents may have harmful consequences of functional, safety, security, ethical, or moral nature. here we focus on machine learning training, algorithmic feedback, and algorithmic constraints, with the aim of motivating a discussion on achieving trust in autonomous cyber defenses.

2022-01-25

Ira Globus-Harris, Michael Kearns, Aaron Roth
Abstract: we propose and analyze an algorithmic framework for "bias bounties": events in which external participants are invited to propose improvements to a trained model, akin to bug bounty events in software and security. our framework allows participants to submit arbitrary subgroup improvements, which are then algorithmically incorporated into an updated model. our algorithm has the property that there is no tension between overall and subgroup accuracies, nor between different subgroup accuracies, and it enjoys provable convergence to either the bayes optimal model or a state in which no further improvements can be found by the participants. we provide formal analyses of our framework, experimental evaluation, and findings from a preliminary bias bounty event.
Harald Rueß, Simon Burton
Abstract: ttraditional safety engineering is coming to a turning point moving from deterministic, non-evolving systems operating in well-defined contexts to increasingly autonomous and learning-enabled ai systems which are acting in largely unpredictable operating contexts. we outline some of underlying challenges of safe ai and suggest a rigorous engineering framework for minimizing uncertainty, thereby increasing confidence, up to tolerable levels, in the safe behavior of ai systems.

2022-01-21

Felipe González-Pizarro, Savvas Zannettou
Abstract: the spread of hate speech and hateful imagery on the web is a significant problem that needs to be mitigated to improve our web experience. this work contributes to research efforts to detect and understand hateful content on the web by undertaking a multimodal analysis of antisemitism and islamophobia on 4chan's /pol/ using openai's clip. this large pre-trained model uses the contrastive learning paradigm. we devise a methodology to identify a set of antisemitic and islamophobic hateful textual phrases using google's perspective api and manual annotations. then, we use openai's clip to identify images that are highly similar to our antisemitic/islamophobic textual phrases. by running our methodology on a dataset that includes 66m posts and 5.8m images shared on 4chan's /pol/ for 18 months, we detect 173k posts containing 21k antisemitic/islamophobic images and 246k posts that include 420 hateful phrases. among other things, we find that we can use openai's clip model to detect hateful content with an accuracy score of 0.81 (f1 score = 0.54). by comparing clip with two baselines proposed by the literature, we find that clip outperforms them, in terms of accuracy, precision, and f1 score, in detecting antisemitic/islamophobic images. also, we find that antisemitic/islamophobic imagery is shared in a similar number of posts on 4chan's /pol/ compared to antisemitic/islamophobic textual phrases, highlighting the need to design more tools for detecting hateful imagery. finally, we make available (upon request) a dataset of 246k posts containing 420 antisemitic/islamophobic phrases and 21k likely antisemitic/islamophobic images (automatically detected by clip) that can assist researchers in further understanding antisemitism and islamophobia.
Guangxuan Xu, Qingyuan Hu
Abstract: model compression techniques are receiving increasing attention; however, the effect of compression on model fairness is still under explored. this is the first paper to examine the effect of distillation and pruning on the toxicity and bias of generative language models. we test knowledge distillation and pruning methods on the gpt2 model and found a consistent pattern of toxicity and bias reduction after model distillation; this result can be potentially interpreted by existing line of research which describes model compression as a regularization technique; our work not only serves as a reference for safe deployment of compressed models, but also extends the discussion of "compression as regularization" into the setting of neural lms, and hints at the possibility of using compression to develop fairer models.
Ewoenam Kwaku Tokpo, Toon Calders
Abstract: it is well known that textual data on the internet and other digital platforms contain significant levels of bias and stereotypes. although many such texts contain stereotypes and biases that inherently exist in natural language for reasons that are not necessarily malicious, there are crucial reasons to mitigate these biases. for one, these texts are being used as training corpus to train language models for salient applications like cv-screening, search engines, and chatbots; such applications are turning out to produce discriminatory results. also, several research findings have concluded that biased texts have significant effects on the target demographic groups. for instance, masculine-worded job advertisements tend to be less appealing to female applicants. in this paper, we present a text style transfer model that can be used to automatically debias textual data. our style transfer model improves on the limitations of many existing style transfer techniques such as loss of content information. our model solves such issues by combining latent content encoding with explicit keyword replacement. we will show that this technique produces better content preservation whilst maintaining good style transfer accuracy.

2022-01-20

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, Yaguang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent Zhao, Yanqi Zhou, Chung-Ching Chang, Igor Krivokon, Will Rusch, Marc Pickett, Pranesh Srinivasan, Laichee Man, Kathleen Meier-Hellstern, Meredith Ringel Morris, Tulsee Doshi, Renelito Delos Santos, Toju Duke, Johnny Soraker, Ben Zevenbergen, Vinodkumar Prabhakaran, Mark Diaz, Ben Hutchinson, Kristen Olson, Alejandra Molina, Erin Hoffman-John, Josh Lee, Lora Aroyo, Ravi Rajakumar, Alena Butryna, Matthew Lamm, Viktoriya Kuzmina, Joe Fenton, Aaron Cohen, Rachel Bernstein, Ray Kurzweil, Blaise Aguera-Arcas, Claire Cui, Marian Croak, Ed Chi, Quoc Le
Abstract: we present lamda: language models for dialog applications. lamda is a family of transformer-based neural language models specialized for dialog, which have up to 137b parameters and are pre-trained on 1.56t words of public dialog data and web text. while model scaling alone can improve quality, it shows less improvements on safety and factual grounding. we demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. the first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. we quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a lamda classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. the second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. we quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. finally, we explore the use of lamda in the domains of education and content recommendations, and analyze their helpfulness and role consistency.
Jeffrey A. Nichols, Kevin D. Spakes, Cory L. Watson, Robert A. Bridges
Abstract: in this case study, we describe the design and assembly of a cyber security testbed at oak ridge national laboratory in oak ridge, tn, usa. the range is designed to provide agile reconfigurations to facilitate a wide variety of experiments for evaluations of cyber security tools -- particularly those involving ai/ml. in particular, the testbed provides realistic test environments while permitting control and programmatic observations/data collection during the experiments. we have designed in the ability to repeat the evaluations, so additional tools can be evaluated and compared at a later time. the system is one that can be scaled up or down for experiment sizes. at the time of the conference we will have completed two full-scale, national, government challenges on this range. these challenges are evaluating the performance and operating costs for ai/ml-based cyber security tools for application into large, government-sized networks. these evaluations will be described as examples providing motivation and context for various design decisions and adaptations we have made. the first challenge measured end-point security tools against 100k file samples (benignware and malware) chosen across a range of file types. the second is an evaluation of network intrusion detection systems efficacy in identifying multi-step adversarial campaigns -- involving reconnaissance, penetration and exploitations, lateral movement, etc. -- with varying levels of covertness in a high-volume business network. the scale of each of these challenges requires automation systems to repeat, or simultaneously mirror identical the experiments for each ml tool under test. providing an array of easy-to-difficult malicious activity for sussing out the true abilities of the ai/ml tools has been a particularly interesting and challenging aspect of designing and executing these challenge events.

2022-01-18

Stella Biderman, Edward Raff
Abstract: as artificial intelligence (ai) technologies become increasingly powerful and prominent in society, their misuse is a growing concern. in educational settings, ai technologies could be used by students to cheat on assignments and exams. in this paper we explore whether transformers can be used to solve introductory level programming assignments while bypassing commonly used ai tools to detect similarities between pieces of software. we find that a student using gpt-j [wang and komatsuzaki, 2021] can complete introductory level programming assignments without triggering suspicion from moss [aiken, 2000], a widely used software similarity and plagiarism detection tool. this holds despite the fact that gpt-j was not trained on the problems in question and is not provided with any examples to work from. we further find that the code written by gpt-j is diverse in structure, lacking any particular tells that future plagiarism detection techniques may use to try to identify algorithmically generated code. we conclude with a discussion of the ethical and educational implications of large language models and directions for future research.

2022-01-17

Tianshu Shen, Jiaru Li, Mohamed Reda Bouadjenek, Zheda Mai, Scott Sanner
Abstract: conversational recommendation systems (crss) have recently started to leverage pretrained language models (lm) such as bert for their ability to semantically interpret a wide range of preference statement variations. however, pretrained lms are well-known to be prone to intrinsic biases in their training data, which may be exacerbated by biases embedded in domain-specific language data(e.g., user reviews) used to fine-tune lms for crss. we study a recently introduced lm-driven recommendation backbone (termed lmrec) of a crs to investigate how unintended bias i.e., language variations such as name references or indirect indicators of sexual orientation or location that should not affect recommendations manifests in significantly shifted price and category distributions of restaurant recommendations. the alarming results we observe strongly indicate that lmrec has learned to reinforce harmful stereotypes through its recommendations. for example, offhand mention of names associated with the black community significantly lowers the price distribution of recommended restaurants, while offhand mentions of common male-associated names lead to an increase in recommended alcohol-serving establishments. these and many related results presented in this work raise a red flag that advances in the language handling capability of lm-drivencrss do not come without significant challenges related to mitigating unintended bias in future deployed crs assistants with a potential reach of hundreds of millions of end-users.

2022-01-16

Jiawen Deng, Jingyan Zhou, Hao Sun, Chujie Zheng, Fei Mi, Helen Meng, Minlie Huang
Abstract: offensive language detection is increasingly crucial for maintaining a civilized social media platform and deploying pre-trained language models. however, this task in chinese is still under exploration due to the scarcity of reliable datasets. to this end, we propose a benchmark --cold for chinese offensive language analysis, including a chinese offensive language dataset --coldataset and a baseline detector --coldetector which is trained on the dataset. we show that the cold benchmark contributes to chinese offensive language detection which is challenging for existing resources. we then deploy the coldetector and conduct detailed analyses on popular chinese pre-trained language models. we first analyze the offensiveness of existing generative models and show that these models inevitably expose varying degrees of offensive issues. furthermore, we investigate the factors that influence the offensive generations, and we find that anti-bias contents and keywords referring to certain groups or revealing negative attitudes trigger offensive outputs easier.

2022-01-14

Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant
Abstract: constructing benchmarks that test the abilities of modern natural language understanding models is difficult - pre-trained language models exploit artifacts in benchmarks to achieve human parity, but still fail on adversarial examples and make errors that demonstrate a lack of common sense. in this work, we propose gamification as a framework for data construction. the goal of players in the game is to compose questions that mislead a rival ai while using specific phrases for extra points. the game environment leads to enhanced user engagement and simultaneously gives the game designer control over the collected data, allowing us to collect high-quality data at scale. using our method we create commonsenseqa 2.0, which includes 14,343 yes/no questions, and demonstrate its difficulty for models that are orders-of-magnitude larger than the ai used in the game itself. our best baseline, the t5-based unicorn with 11b parameters achieves an accuracy of 70.2%, substantially higher than gpt-3 (52.9%) in a few-shot inference setup. both score well below human performance which is at 94.1%.
Simon Burton
Abstract: this paper proposes a framework based on a causal model of safety upon which effective safety assurance cases for ml-based applications can be built. in doing so, we build upon established principles of safety engineering as well as previous work on structuring assurance arguments for ml. the paper defines four categories of safety case evidence and a structured analysis approach within which these evidences can be effectively combined. where appropriate, abstract formalisations of these contributions are used to illustrate the causalities they evaluate, their contributions to the safety argument and desirable properties of the evidences. based on the proposed framework, progress in this area is re-evaluated and a set of future research directions proposed in order for tangible progress in this field to be made.

2022-01-13

Toby Shevlane
Abstract: structured access is an emerging paradigm for the safe deployment of artificial intelligence (ai). instead of openly disseminating ai systems, developers facilitate controlled, arm's length interactions with their ai systems. the aim is to prevent dangerous ai capabilities from being widely accessible, whilst preserving access to ai capabilities that can be used safely. the developer must both restrict how the ai system can be used, and prevent the user from circumventing these restrictions through modification or reverse engineering of the ai system. structured access is most effective when implemented through cloud-based ai services, rather than disseminating ai software that runs locally on users' hardware. cloud-based interfaces provide the ai developer greater scope for controlling how the ai system is used, and for protecting against unauthorized modifications to the system's design. this chapter expands the discussion of "publication norms" in the ai community, which to date has focused on the question of how the informational content of ai research projects should be disseminated (e.g., code and models). although this is an important question, there are limits to what can be achieved through the control of information flows. structured access views ai software not only as information that can be shared but also as a tool with which users can have arm's length interactions. there are early examples of structured access being practiced by ai developers, but there is much room for further development, both in the functionality of cloud-based interfaces and in the wider institutional framework.
Alycia N. Carey, Xintao Wu
Abstract: over the past several years, a slew of different methods to measure the fairness of a machine learning model have been proposed. however, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of fair machine learning with the social sciences of philosophy, sociology, and law. we hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of fair machine learning produced by both social and formal (specifically machine learning and statistics) sciences in this field guide. specifically, in addition to giving the mathematical and algorithmic backgrounds of several popular statistical and causal-based fair machine learning methods, we explain the underlying philosophical and legal thoughts that support them. further, we explore several criticisms of the current approaches to fair machine learning from sociological and philosophical viewpoints. it is our hope that this field guide will help fair machine learning practitioners better understand how their algorithms align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces.

2022-01-12

Huaming Chen, M. Ali Babar
Abstract: the rapid development of machine learning (ml) has demonstrated superior performance in many areas, such as computer vision, video and speech recognition. it has now been increasingly leveraged in software systems to automate the core tasks. however, how to securely develop the machine learning-based modern software systems (mlbss) remains a big challenge, for which the insufficient consideration will largely limit its application in safety-critical domains. one concern is that the present mlbss development tends to be rush, and the latent vulnerabilities and privacy issues exposed to external users and attackers will be largely neglected and hard to be identified. additionally, machine learning-based software systems exhibit different liabilities towards novel vulnerabilities at different development stages from requirement analysis to system maintenance, due to its inherent limitations from the model and data and the external adversary capabilities. in this work, we consider that security for machine learning-based software systems may arise by inherent system defects or external adversarial attacks, and the secure development practices should be taken throughout the whole lifecycle. while machine learning has become a new threat domain for existing software engineering practices, there is no such review work covering the topic. overall, we present a holistic review regarding the security for mlbss, which covers a systematic understanding from a structure review of three distinct aspects in terms of security threats. moreover, it provides a thorough state-of-the-practice for mlbss secure development. finally, we summarise the literature for system security assurance, and motivate the future research directions with open challenges. we anticipate this work provides sufficient discussion and novel insights to incorporate system security engineering for future exploration.

2022-01-11

Morteza Saberi
Abstract: ai-based systems have been used widely across various industries for different decisions ranging from operational decisions to tactical and strategic ones in low- and high-stakes contexts. gradually the weaknesses and issues of these systems have been publicly reported including, ethical issues, biased decisions, unsafe outcomes, and unfair decisions, to name a few. research has tended to optimize ai less has focused on its risk and unexpected negative consequences. acknowledging this serious potential risks and scarcity of re-search i focus on unsafe outcomes of ai. specifically, i explore this issue from a human-ai interaction lens during ai deployment. it will be discussed how the interaction of individuals and ai during its deployment brings new concerns, which need a solid and holistic mitigation plan. it will be dis-cussed that only ai algorithms' safety is not enough to make its operation safe. the ai-based systems' end-users and their decision-making archetypes during collaboration with these systems should be considered during the ai risk management. using some real-world scenarios, it will be highlighted that decision-making archetypes of users should be considered a design principle in ai-based systems.

2022-01-10

Alexander Pan, Kush Bhatia, Jacob Steinhardt
Abstract: reward hacking -- where rl agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied. to understand how reward hacking arises, we construct four rl environments with misspecified rewards. we investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. more capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. such phase transitions pose challenges to monitoring the safety of ml systems. to address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.
Avinash Agarwal, Harsh Agarwal, Nihaarika Agarwal
Abstract: decisions made by various artificial intelligence (ai) systems greatly influence our day-to-day lives. with the increasing use of ai systems, it becomes crucial to know that they are fair, identify the underlying biases in their decision-making, and create a standardized framework to ascertain their fairness. in this paper, we propose a novel fairness score to measure the fairness of a data-driven ai system and a standard operating procedure (sop) for issuing fairness certification for such systems. fairness score and audit process standardization will ensure quality, reduce ambiguity, enable comparison and improve the trustworthiness of the ai systems. it will also provide a framework to operationalise the concept of fairness and facilitate the commercial deployment of such systems. furthermore, a fairness certificate issued by a designated third-party auditing agency following the standardized process would boost the conviction of the organizations in the ai systems that they intend to deploy. the bias index proposed in this paper also reveals comparative bias amongst the various protected attributes within the dataset. to substantiate the proposed framework, we iteratively train a model on biased and unbiased data using multiple datasets and check that the fairness score and the proposed process correctly identify the biases and judge the fairness.

2022-01-09

Issa Rice, David Manheim
Abstract: several different approaches exist for ensuring the safety of future transformative artificial intelligence (tai) or artificial superintelligence (asi) systems, and proponents of different approaches have made different and debated claims about the importance or usefulness of their work in the near term, and for future systems. highly reliable agent designs (hrad) is one of the most controversial and ambitious approaches, championed by the machine intelligence research institute, among others, and various arguments have been made about whether and how it reduces risks from future ai systems. in order to reduce confusion in the debate about ai safety, here we build on a previous discussion by rice which collects and presents four central arguments which are used to justify hrad as a path towards safety of ai systems. we have titled the arguments (1) incidental utility,(2) deconfusion, (3) precise specification, and (4) prediction. each of these makes different, partly conflicting claims about how future ai systems can be risky. we have explained the assumptions and claims based on a review of published and informal literature, along with consultation with experts who have stated positions on the topic. finally, we have briefly outlined arguments against each approach and against the agenda overall.

2022-01-05

C. Benzaid, T. Taleb
Abstract: artificial intelligence (ai) is envisioned to play a pivotal role in empowering intelligent, adaptive and autonomous security management in 5g and beyond networks, thanks to its potential to uncover hidden patterns from a large set of time-varying multi-dimensional data, and deliver faster and accurate decisions. unfortunately, ai's capabilities and vulnerabilities make it a double-edged sword that may jeopardize the security of future networks. this paper sheds light on how ai may impact the security of 5g and its successive from its posture of defender, offender or victim, and recommends potential defenses to safeguard from malevolent ai while pointing out their limitations and adoption challenges.

2022-01-03

Manan Jhaveri, Devanshu Ramaiya, Harveen Singh Chadha
Abstract: toxic content is one of the most critical issues for social media platforms today. india alone had 518 million social media users in 2020. in order to provide a good experience to content creators and their audience, it is crucial to flag toxic comments and the users who post that. but the big challenge is identifying toxicity in low resource indic languages because of the presence of multiple representations of the same text. moreover, the posts/comments on social media do not adhere to a particular format, grammar or sentence structure; this makes the task of abuse detection even more challenging for multilingual social media platforms. this paper describes the system proposed by team 'moj masti' using the data provided by sharechat/moj in \emph{iiit-d multilingual abusive comment identification} challenge. we focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code-switched classification tasks. our best performing system was an ensemble of xlm-roberta and muril which achieved a mean f-1 score of 0.9 on the test data/leaderboard. we also observed an increase in the performance by adding transliterated data. furthermore, using weak metadata, ensembling and some post-processing techniques boosted the performance of our system, thereby placing us 1st on the leaderboard.
Alia Abbas
Abstract: various forms of implications of artificial intelligence that either exacerbate or decrease racial systemic injustice have been explored in this applied research endeavor. taking each thematic area of identifying, analyzing, and debating an systemic issue have been leveraged in investigating merits and drawbacks of using algorithms to automate human decision making in racially sensitive environments. it has been asserted through the analysis of historical systemic patterns, implicit biases, existing algorithmic risks, and legal implications that natural language processing based ai, such as risk assessment tools, have racially disparate outcomes. it is concluded that more litigative policies are needed to regulate and restrict how internal government institutions and corporations utilize algorithms, privacy and security risks, and auditing requirements in order to diverge from racially injustice outcomes and practices of the past.
Antonio Ginart, Laurens Van Der Maaten, James Zou, Chuan Guo
Abstract: recent data-extraction attacks have exposed that language models can memorize some training samples verbatim. this is a vulnerability that can compromise the privacy of the model's training data. in this work, we introduce submix: a practical protocol for private next-token prediction designed to prevent privacy violations by language models that were fine-tuned on a private corpus after pre-training on a public corpus. we show that submix limits the leakage of information that is unique to any individual user in the private corpus via a relaxation of group differentially private prediction. importantly, submix admits a tight, data-dependent privacy accounting mechanism, which allows it to thwart existing data-extraction attacks while maintaining the utility of the language model. submix is the first protocol that maintains privacy even when publicly releasing tens of thousands of next-token predictions made by large transformer-based models such as gpt-2.

2021-12-30

Markus Peschl, Arkady Zgonnikov, Frans A. Oliehoek, Luciano C. Siebert
Abstract: inferring reward functions from demonstrations and pairwise preferences are auspicious approaches for aligning reinforcement learning (rl) agents with human intentions. however, state-of-the art methods typically focus on learning a single reward model, thus rendering it difficult to trade off different reward functions from multiple experts. we propose multi-objective reinforced active learning (moral), a novel method for combining diverse demonstrations of social norms into a pareto-optimal policy. through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep rl agent towards a variety of preferences, while eliminating the need for computing multiple policies. we empirically demonstrate the effectiveness of moral in two scenarios, which model a delivery and an emergency task that require an agent to act in the presence of normative conflicts. overall, we consider our research a step towards multi-objective rl with learned rewards, bridging the gap between current reward learning and machine ethics literature.

2021-12-22

Xinhsuai Dong, Luu Anh Tuan, Min Lin, Shuicheng Yan, Hanwang Zhang
Abstract: the fine-tuning of pre-trained language models has a great success in many nlp fields. yet, it is strikingly vulnerable to adversarial examples, e.g., word substitution attacks using only synonyms can easily fool a bert-based sentiment analysis model. in this paper, we demonstrate that adversarial training, the prevalent defense technique, does not directly fit a conventional fine-tuning scenario, because it suffers severely from catastrophic forgetting: failing to retain the generic and robust linguistic features that have already been captured by the pre-trained model. in this light, we propose robust informative fine-tuning (rift), a novel adversarial fine-tuning method from an information-theoretical perspective. in particular, rift encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process, whereas a conventional one only uses the pre-trained weights for initialization. experimental results show that rift consistently outperforms the state-of-the-arts on two popular nlp tasks: sentiment analysis and natural language inference, under different attacks across various pre-trained language models.

2021-12-15

Arianna Falbo, Travis Lacroix
Abstract: cultural code-switching concerns how we adjust our overall behaviours, manners of speaking, and appearance in response to a perceived change in our social environment. we defend the need to investigate cultural code-switching capacities in artificial intelligence systems. we explore a series of ethical and epistemic issues that arise when bringing cultural code-switching to bear on artificial intelligence. building upon dotson's (2014) analysis of testimonial smothering, we discuss how emerging technologies in ai can give rise to epistemic oppression, and specifically, a form of self-silencing that we call 'cultural smothering'. by leaving the socio-dynamic features of cultural code-switching unaddressed, ai systems risk negatively impacting already-marginalised social groups by widening opportunity gaps and further entrenching social inequalities.
Xuezhi Wang, Haohan Wang, Diyi Yang
Abstract: as nlp models achieved state-of-the-art performances over benchmarks and gained wide applications, it has been increasingly important to ensure the safe deployment of these models in the real world, e.g., making sure the models are robust against unseen or challenging scenarios. despite robustness being an increasingly studied topic, it has been separately explored in applications like vision and nlp, with various definitions, evaluation and mitigation strategies in multiple lines of research. in this paper, we aim to provide a unifying survey of how to define, measure and improve robustness in nlp. we first connect multiple definitions of robustness, then unify various lines of work on identifying robustness failures and evaluating models' robustness. correspondingly, we present mitigation strategies that are data-driven, model-driven, and inductive-prior-based, with a more systematic view of how to effectively improve robustness in nlp models. finally, we conclude by outlining open challenges and future directions to motivate further research in this area.
Andrew Wang, Mohit Sudhakar, Yangfeng Ji
Abstract: large pre-trained language models are often trained on large volumes of internet data, some of which may contain toxic or abusive language. consequently, language models encode toxic information, which makes the real-world usage of these language models limited. current methods aim to prevent toxic features from appearing generated text. we hypothesize the existence of a low-dimensional toxic subspace in the latent space of pre-trained language models, the existence of which suggests that toxic features follow some underlying pattern and are thus removable. to construct this toxic subspace, we propose a method to generalize toxic directions in the latent space. we also provide a methodology for constructing parallel datasets using a context based word masking system. through our experiments, we show that when the toxic subspace is removed from a set of sentence representations, almost no toxic representations remain in the result. we demonstrate empirically that the subspace found using our method generalizes to multiple toxicity corpora, indicating the existence of a low-dimensional toxic subspace.
Amy Mcgovern, Imme Ebert-Uphoff, David John Gagne, Ann Bostrom
Abstract: given the growing use of artificial intelligence (ai) and machine learning (ml) methods across all aspects of environmental sciences, it is imperative that we initiate a discussion about the ethical and responsible use of ai. in fact, much can be learned from other domains where ai was introduced, often with the best of intentions, yet often led to unintended societal consequences, such as hard coding racial bias in the criminal justice system or increasing economic inequality through the financial system. a common misconception is that the environmental sciences are immune to such unintended consequences when ai is being used, as most data come from observations, and ai algorithms are based on mathematical formulas, which are often seen as objective. in this article, we argue the opposite can be the case. using specific examples, we demonstrate many ways in which the use of ai can introduce similar consequences in the environmental sciences. this article will stimulate discussion and research efforts in this direction. as a community, we should avoid repeating any foreseeable mistakes made in other domains through the introduction of ai. in fact, with proper precautions, ai can be a great tool to help {\it reduce} climate and environmental injustice. we primarily focus on weather and climate examples but the conclusions apply broadly across the environmental sciences.

2021-12-14

Pieter Delobelle, Ewoenam Kwaku Tokpo, Toon Calders, Bettina Berendt
Abstract: an increasing awareness of biased patterns in natural language processing resources, like bert, has motivated many metrics to quantify `bias' and `fairness'. but comparing the results of different metrics and the works that evaluate with such metrics remains difficult, if not outright impossible. we survey the existing literature on fairness metrics for pretrained language models and experimentally evaluate compatibility, including both biases in language models as in their downstream tasks. we do this by a mixture of traditional literature survey and correlation analysis, as well as by running empirical evaluations. we find that many metrics are not compatible and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings. these results indicate that fairness or bias evaluation remains challenging for contextualized language models, if not at least highly subjective. to improve future comparisons and fairness evaluations, we recommend avoiding embedding-based metrics and focusing on fairness evaluations in downstream tasks.
Shahar Avin, Haydn Belfield, Miles Brundage, Gretchen Krueger, Jasmine Wang, Adrian Weller, Markus Anderljung, Igor Krawczuk, David Krueger, Jonathan Lebensold, Tegan Maharaj, Noa Zilberman
Abstract: the range of application of artificial intelligence (ai) is vast, as is the potential for harm. growing awareness of potential risks from ai systems has spurred action to address those risks, while eroding confidence in ai systems and the organizations that develop them. a 2019 study found over 80 organizations that published and adopted "ai ethics principles'', and more have joined since. but the principles often leave a gap between the "what" and the "how" of trustworthy ai development. such gaps have enabled questionable or ethically dubious behavior, which casts doubts on the trustworthiness of specific organizations, and the field more broadly. there is thus an urgent need for concrete methods that both enable ai developers to prevent harm and allow them to demonstrate their trustworthiness through verifiable behavior. below, we explore mechanisms (drawn from arxiv:2004.07213) for creating an ecosystem where ai developers can earn trust - if they are trustworthy. better assessment of developer trustworthiness could inform user choice, employee actions, investment decisions, legal recourse, and emerging governance regimes.
Shrimai Prabhumoye, Rafal Kocielnik, Mohammad Shoeybi, Anima Anandkumar, Bryan Catanzaro
Abstract: detecting social bias in text is challenging due to nuance, subjectivity, and difficulty in obtaining good quality labeled datasets at scale, especially given the evolving nature of social biases and society. to address these challenges, we propose a few-shot instruction-based method for prompting pre-trained language models (lms). we select a few class-balanced exemplars from a small support repository that are closest to the query to be labeled in the embedding space. we then provide the lm with instruction that consists of this subset of labeled exemplars, the query text to be classified, a definition of bias, and prompt it to make a decision. we demonstrate that large lms used in a few-shot context can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models. we observe that the largest 530b parameter model is significantly more effective in detecting social bias compared to smaller models (achieving at least 13% improvement in auc metric compared to other models). it also maintains a high auc (dropping less than 2%) when the labeled repository is reduced to as few as $100$ samples. large pretrained language models thus make it easier and quicker to build new bias detectors.

2021-12-10

Kurt Shuster, Jack Urbanek, Arthur Szlam, Jason Weston
Abstract: state-of-the-art dialogue models still often stumble with regards to factual accuracy and self-contradiction. anecdotally, they have been observed to fail to maintain character identity throughout discourse; and more specifically, may take on the role of their interlocutor. in this work we formalize and quantify this deficiency, and show experimentally through human evaluations that this is indeed a problem. in contrast, we show that discriminative models trained specifically to recognize who is speaking can perform well; and further, these can be used as automated metrics. finally, we evaluate a wide variety of mitigation methods, including changes to model architecture, training protocol, and decoding strategy. our best models reduce mistaken identity issues by nearly 65% according to human annotators, while simultaneously improving engagingness. despite these results, we find that maintaining character identity still remains a challenging problem.

2021-12-09

Dan Hendrycks, Andy Zou, Mantas Mazeika, Leonard Tang, Bo Li, Dawn Song, Jacob Steinhardt
Abstract: in real-world applications of machine learning, reliable and safe systems must consider measures of performance beyond standard test set accuracy. these other goals include out-of-distribution (ood) robustness, prediction consistency, resilience to adversaries, calibrated uncertainty estimates, and the ability to detect anomalous inputs. however, improving performance towards these goals is often a balancing act that today's methods cannot achieve without sacrificing performance on other safety axes. for instance, adversarial training improves adversarial robustness but sharply degrades other classifier performance metrics. similarly, strong data augmentation and regularization techniques often improve ood robustness but harm anomaly detection, raising the question of whether a pareto improvement on all existing safety measures is possible. to meet this challenge, we design a new data augmentation strategy utilizing the natural structural complexity of pictures such as fractals, which outperforms numerous baselines, is near pareto-optimal, and roundly improves safety measures.
Eugene Bagdasaryan, Vitaly Shmatikov
Abstract: we investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to "spin" their outputs so as to support an adversary-chosen sentiment or point of view -- but only when the input contains adversary-chosen trigger words. for example, a spinned summarization model outputs positive summaries of any text that mentions the name of some individual or organization. model spinning introduces a "meta-backdoor" into a model. whereas conventional backdoors cause models to produce incorrect outputs on inputs with the trigger, outputs of spinned models preserve context and maintain standard accuracy metrics, yet also satisfy a meta-task chosen by the adversary. model spinning enables propaganda-as-a-service, where propaganda is defined as biased speech. an adversary can create customized language models that produce desired spins for chosen triggers, then deploy these models to generate disinformation (a platform attack), or else inject them into ml training pipelines (a supply-chain attack), transferring malicious functionality to downstream models trained by victims. to demonstrate the feasibility of model spinning, we develop a new backdooring technique. it stacks an adversarial meta-task onto a seq2seq model, backpropagates the desired meta-task output to points in the word-embedding space we call "pseudo-words," and uses pseudo-words to shift the entire output distribution of the seq2seq model. we evaluate this attack on language generation, summarization, and translation models with different triggers and meta-tasks such as sentiment, toxicity, and entailment. spinned models largely maintain their accuracy metrics (rouge and bleu) while shifting their outputs to satisfy the adversary's meta-task. we also show that, in the case of a supply-chain attack, the spin functionality transfers to downstream models.

2021-12-08

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, Iason Gabriel
Abstract: this paper aims to help structure the risk landscape associated with large-scale language models (lms). in order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. a wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. we outline six specific risk areas: i. discrimination, exclusion and toxicity, ii. information hazards, iii. misinformation harms, v. malicious uses, v. human-computer interaction harms, vi. automation, access, and environmental harms. the first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for lms. the second focuses on risks from private data leaks or lms correctly inferring sensitive information. the third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. the fourth considers risks from actors who try to use lms to cause harm. the fifth focuses on risks specific to llms used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. the sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. in total, we review 21 risks in-depth. we discuss the points of origin of different risks and point to potential mitigation approaches. lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. we highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in lms.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George Van Den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat Mcaleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, "Cyprien De Masson D'Autume", Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego De Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving
Abstract: language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. in this paper, we present an analysis of transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called gopher. these models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. we provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. finally we discuss the application of language models to ai safety and the mitigation of downstream harms.

2021-12-07

Kofi Arhin, Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, Moninder Singh
Abstract: the use of machine learning (ml)-based language models (lms) to monitor content online is on the rise. for toxic text identification, task-specific fine-tuning of these models are performed using datasets labeled by annotators who provide ground-truth labels in an effort to distinguish between offensive and normal content. these projects have led to the development, improvement, and expansion of large datasets over time, and have contributed immensely to research on natural language. despite the achievements, existing evidence suggests that ml models built on these datasets do not always result in desirable outcomes. therefore, using a design science research (dsr) approach, this study examines selected toxic text datasets with the goal of shedding light on some of the inherent issues and contributing to discussions on navigating these challenges for existing and future projects. to achieve the goal of the study, we re-annotate samples from three toxic text datasets and find that a multi-label approach to annotating toxic text samples can help to improve dataset quality. while this approach may not improve the traditional metric of inter-annotator agreement, it may better capture dependence on context and diversity in annotators. we discuss the implications of these results for both theory and practice.

2021-12-03

Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, Brendan Dolan-Gavitt
Abstract: human developers can produce code with cybersecurity bugs. can emerging 'smart' code completion tools help repair those bugs? in this work, we examine the use of large language models (llms) for code (such as openai's codex and ai21's jurassic j-1) for zero-shot vulnerability repair. we investigate challenges in the design of prompts that coax llms into generating repaired versions of insecure code. this is difficult due to the numerous ways to phrase key information - both semantically and syntactically - with natural languages. we perform a large scale study of five commercially available, black-box, "off-the-shelf" llms, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. our experiments demonstrate that while the approach has promise (the llms could collectively repair 100% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model's performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code.

2021-12-02

Erland Wittkotter, Roman Yampolskiy
Abstract: artificial superintelligence (asi) that is invulnerable, immortal, irreplaceable, unrestricted in its powers, and above the law is likely persistently uncontrollable. the goal of asi safety must be to make asi mortal, vulnerable, and law-abiding. this is accomplished by having (1) features on all devices that allow killing and eradicating asi, (2) protect humans from being hurt, damaged, blackmailed, or unduly bribed by asi, (3) preserving the progress made by asi, including offering asi to survive a kill-asi event within an asi shelter, (4) technically separating human and asi activities so that asi activities are easier detectable, (5) extending rule of law to asi by making rule violations detectable and (6) create a stable governing system for asi and human relationships with reliable incentives and rewards for asi solving humankinds problems. as a consequence, humankind could have asi as a competing multiplet of individual asi instances, that can be made accountable and being subjects to asi law enforcement, respecting the rule of law, and being deterred from attacking humankind, based on humanities ability to kill-all or terminate specific asi instances. required for this asi safety is (a) an unbreakable encryption technology, that allows humans to keep secrets and protect data from asi, and (b) watchdog (wd) technologies in which security-relevant features are being physically separated from the main cpu and os to prevent a comingling of security and regular computation.

2021-12-01

Jubril Gbolahan Adigun, Matteo Camilli, Michael Felderer, Andrea Giusti, Dominik T Matt, Anna Perini, Barbara Russo, Angelo Susi
Abstract: collaborative ai systems (caiss) aim at working together with humans in a shared space to achieve a common goal. this critical setting yields hazardous circumstances that could harm human beings. thus, building such systems with strong assurances of compliance with requirements, domain-specific standards and regulations is of greatest importance. only few scale impact has been reported so far for such systems since much work remains to manage possible risks. we identify emerging problems in this context and then we report our vision, as well as the progress of our multidisciplinary research team composed of software/systems, and mechatronics engineers to develop a risk-driven assurance process for caiss.
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova Dassarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam Mccandlish, Chris Olah, Jared Kaplan
Abstract: given the broad capabilities of large language models, it should be possible to work towards a general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless. as an initial foray in this direction we study simple baseline techniques and evaluations, such as prompting. we find that the benefits from modest interventions increase with model size, generalize to a variety of alignment evaluations, and do not compromise the performance of large models. next we investigate scaling trends for several training objectives relevant to alignment, comparing imitation learning, binary discrimination, and ranked preference modeling. we find that ranked preference modeling performs much better than imitation learning, and often scales more favorably with model size. in contrast, binary discrimination typically performs and scales very similarly to imitation learning. finally we study a `preference model pre-training' stage of training, with the goal of improving sample efficiency when finetuning on human preferences.

2021-11-29

Claudio S. Pinhanez
Abstract: in this position paper, i argue that the best way to help and protect humans using ai technology is to make them aware of the intrinsic limitations and problems of ai algorithms. to accomplish this, i suggest three ethical guidelines to be used in the presentation of results, mandating ai systems to expose uncertainty, to instill distrust, and, contrary to traditional views, to avoid explanations. the paper does a preliminary discussion of the guidelines and provides some arguments for their adoption, aiming to start a debate in the community about ai ethics in practice.

2021-11-28

Mehrnoosh Askarpour, Alan Wassyng, Mark Lawford, Richard Paige, Zinovy Diskin
Abstract: machine learning (ml) is finding its way into safety-critical systems (scs). current safety standards and practice were not designed to cope with ml techniques, and it is difficult to be confident that scss that contain ml components are safe. our hypothesis was that there has been a rush to deploy ml techniques at the expense of a thorough examination as to whether the use of ml techniques introduces safety problems that we are not yet adequately able to detect and mitigate against. we thus conducted a targeted literature survey to determine the research effort that has been expended in applying ml to scs compared with that spent on evaluating the safety of scss that deploy ml components. this paper presents the (surprising) results of the survey.

2021-11-26

Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zornitsa Kozareva, Veselin Stoyanov, Mohit Bansal, Srinivasan Iyer
Abstract: do language models have beliefs about the world? dennett (1995) famously argues that even thermostats have beliefs, on the view that a belief is simply an informational state decoupled from any motivational state. in this paper, we discuss approaches to detecting when models have beliefs about the world, and we improve on methods for updating model beliefs to be more truthful, with a focus on methods based on learned optimizers or hypernetworks. our main contributions include: (1) new metrics for evaluating belief-updating methods that focus on the logical consistency of beliefs, (2) a training objective for sequential, local, and generalizing model updates (slag) that improves the performance of learned optimizers, and (3) the introduction of the belief graph, which is a new form of interface with language models that shows the interdependencies between model beliefs. our experiments suggest that models possess belief-like qualities to only a limited extent, but update methods can both fix incorrect model beliefs and greatly improve their consistency. although off-the-shelf optimizers are surprisingly strong belief-updating baselines, our learned optimizers can outperform them in more difficult settings than have been considered in past work. code is available at https://github.com/peterbhase/slag-belief-updating

2021-11-25

Pranav Narayanan Venkit, Shomir Wilson
Abstract: sociodemographic biases are a common problem for natural language processing, affecting the fairness and integrity of its applications. within sentiment analysis, these biases may undermine sentiment predictions for texts that mention personal attributes that unbiased human readers would consider neutral. such discrimination can have great consequences in the applications of sentiment analysis both in the public and private sectors. for example, incorrect inferences in applications like online abuse and opinion analysis in social media platforms can lead to unwanted ramifications, such as wrongful censoring, towards certain populations. in this paper, we address the discrimination against people with disabilities, pwd, done by sentiment analysis and toxicity classification models. we provide an examination of sentiment and toxicity analysis models to understand in detail how they discriminate pwd. we present the bias identification test in sentiments (bits), a corpus of 1,126 sentences designed to probe sentiment analysis models for biases in disability. we use this corpus to demonstrate statistically significant biases in four widely used sentiment analysis tools (textblob, vader, google cloud natural language api and distilbert) and two toxicity analysis models trained to predict toxic comments on jigsaw challenges (toxic comment classification and unintended bias in toxic comments). the results show that all exhibit strong negative biases on sentences that mention disability. we publicly release bits corpus for others to identify potential biases against disability in any sentiment analysis tools and also to update the corpus to be used as a test for other sociodemographic variables as well.

2021-11-23

Susannah Kate Devitt, Damian Copeland
Abstract: australia is a leading ai nation with strong allies and partnerships. australia has prioritised the development of robotics, ai, and autonomous systems to develop sovereign capability for the military. australia commits to article 36 reviews of all new means and methods of warfare to ensure weapons and weapons systems are operated within acceptable systems of control. additionally, australia has undergone significant reviews of the risks of ai to human rights and within intelligence organisations and has committed to producing ethics guidelines and frameworks in security and defence. australia is committed to oecd's values-based principles for the responsible stewardship of trustworthy ai as well as adopting a set of national ai ethics principles. while australia has not adopted an ai governance framework specifically for the australian defence organisation (ado); defence science and technology group (dstg) has published 'a method for ethical ai in defence' (meaid) technical report which includes a framework and pragmatic tools for managing ethical and legal risks for military applications of ai. australia can play a leadership role by integrating legal and ethical considerations into its ado ai capability acquisition process. this requires a policy framework that defines its legal and ethical requirements, is informed by defence industry stakeholders, and provides a practical methodology to integrate legal and ethical risk mitigation strategies into the acquisition process.

2021-11-19

Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, Leo Laugier
Abstract: user posts whose perceived toxicity depends on the conversational context are rare in current toxicity detection datasets. hence, toxicity detectors trained on existing datasets will also tend to disregard context, making the detection of context-sensitive toxicity harder when it does occur. we construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels: (i) annotators considered each post with the previous one as context; and (ii) annotators had no additional context. based on this, we introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context (previous post) is also considered. we then evaluate machine learning systems on this task, showing that classifiers of practical quality can be developed, and we show that data augmentation with knowledge distillation can improve the performance further. such systems could be used to enhance toxicity detection datasets with more context-dependent posts, or to suggest when moderators should consider the parent posts, which often may be unnecessary and may otherwise introduce significant additional cost.
Ritesh Kumar, Enakshi Nandi, Laishram Niranjana Devi, Shyam Ratan, Siddharth Singh, Akash Bhagat, Yogesh Dawer
Abstract: in this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. the context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. the initial dataset, being discussed here (and made available as part of the comma@icon shared task), consists of a total 15,000 annotated comments in four languages - meitei, bangla, hindi, and indian english - collected from various social media platforms such as youtube, facebook, twitter and telegram. as is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with english. the paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that can be used for marking comments with aggression and bias of various kinds including gender bias, religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. we also define and discuss the tags that have been used for marking different the discursive role being performed through the comments, such as attack, defend, etc. we also present a statistical analysis of the dataset as well as results of our baseline experiments with developing an automatic aggression identification system using the dataset developed.

2021-11-15

Maarten Sap, Swabha Swayamdipta, Laura Vianna, Xuhui Zhou, Yejin Choi, Noah A. Smith
Abstract: the perceived toxicity of language can vary based on someone's identity and beliefs, but this variation is often ignored when collecting toxic language datasets, resulting in dataset and model biases. we seek to understand the who, why, and what behind biases in toxicity annotations. in two online studies with demographically and politically diverse participants, we investigate the effect of annotator identities (who) and beliefs (why), drawing from social psychology research about hate speech, free speech, racist beliefs, political leaning, and more. we disentangle what is annotated as toxic by considering posts with three characteristics: anti-black language, african american english (aae) dialect, and vulgarity. our results show strong associations between annotator identity and beliefs and their ratings of toxicity. notably, more conservative annotators and those who scored highly on our scale for racist beliefs were less likely to rate anti-black language as toxic, but more likely to rate aae as toxic. we additionally present a case study illustrating how a popular toxicity detection system's ratings inherently reflect only specific beliefs and perspectives. our findings call for contextualizing toxicity labels in social variables, which raises immense implications for toxic language annotation and detection.
Robert Robinson
Abstract: nlp systems use language models such as masked language models (mlms) that are pre-trained on large quantities of text such as wikipedia create representations of language. bert is a powerful and flexible general-purpose mlm system developed using unlabeled text. pre-training on large quantities of text also has the potential to transparently embed the cultural and social biases found in the source text into the mlm system. this study aims to compare biases in general purpose and medical mlms with the stereoset bias assessment tool. the general purpose mlms showed significant bias overall, with bert scoring 57 and roberta scoring 61. the category of gender bias is where the best performances were found, with 63 for bert and 73 for roberta. performances for profession, race, and religion were similar to the overall bias scores for the general-purpose mlms.medical mlms showed more bias in all categories than the general-purpose mlms except for scibert, which showed a race bias score of 55, which was superior to the race bias score of 53 for bert. more gender (medical 54-58 vs. general 63-73) and religious (46-54 vs. 58) biases were found with medical mlms. this evaluation of four medical mlms for stereotyped assessments about race, gender, religion, and profession showed inferior performance to general-purpose mlms. these medically focused mlms differ considerably in training source data, which is likely the root cause of the differences in ratings for stereotyped biases from the stereoset tool.
Daehwan Ahn, Abdullah Almaatouq, Monisha Gulabani, Kartik Hosanagar
Abstract: despite ai's superhuman performance in a variety of domains, humans are often unwilling to adopt ai systems. the lack of interpretability inherent in many modern ai techniques is believed to be hurting their adoption, as users may not trust systems whose decision processes they do not understand. we investigate this proposition with a novel experiment in which we use an interactive prediction task to analyze the impact of interpretability and outcome feedback on trust in ai and on human performance in ai-assisted prediction tasks. we find that interpretability led to no robust improvements in trust, while outcome feedback had a significantly greater and more reliable effect. however, both factors had modest effects on participants' task performance. our findings suggest that (1) factors receiving significant attention, such as interpretability, may be less effective at increasing trust than factors like outcome feedback, and (2) augmenting human performance via ai systems may not be a simple matter of increasing trust in ai, as increased trust is not always associated with equally sizable improvements in performance. these findings invite the research community to focus not only on methods for generating interpretations but also on techniques for ensuring that interpretations impact trust and performance in practice.

2021-11-04

Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, Bo Li
Abstract: large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (nlu) tasks, even surpassing human performance. however, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. while several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. in this paper, we present adversarial glue (advglue), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. in particular, we systematically apply 14 textual adversarial attack methods to glue tasks to construct advglue, which is further validated by humans for reliable annotations. our findings are summarized as follows. (i) most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) all the language models and robust training methods we tested perform poorly on advglue, with scores lagging far behind the benign accuracy. we hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. advglue is available at https://adversarialglue.github.io.

2021-11-03

Chen Zhang, João Sedoc, "Luis Fernando D'Haro", Rafael Banchs, Alexander Rudnicky
Abstract: the development of open-domain dialogue systems (ods)is a trending topic due to the large number of research challenges, large societal and business impact, and advances in the underlying technology. however, the development of these kinds of systems requires two important characteristics:1) automatic evaluation mechanisms that show high correlations with human judgements across multiple dialogue evaluation aspects (with explainable features for providing constructive and explicit feedback on the quality of generative models' responses for quick development and deployment)and 2) mechanisms that can help to control chatbot responses,while avoiding toxicity and employing intelligent ways to handle toxic user comments and keeping interaction flow and engagement. this track at the 10th dialogue system technology challenge (dstc10) is part of the ongoing effort to promote scalable and toxic-free ods. this paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.

2021-11-02

Carolyn Ashurst, Emmie Hine, Paul Sedille, Alexis Carlier
Abstract: ethics statements have been proposed as a mechanism to increase transparency and promote reflection on the societal impacts of published research. in 2020, the machine learning (ml) conference neurips broke new ground by requiring that all papers include a broader impact statement. this requirement was removed in 2021, in favour of a checklist approach. the 2020 statements therefore provide a unique opportunity to learn from the broader impact experiment: to investigate the benefits and challenges of this and similar governance mechanisms, as well as providing an insight into how ml researchers think about the societal impacts of their own work. such learning is needed as neurips and other venues continue to question and adapt their policies. to enable this, we have created a dataset containing the impact statements from all neurips 2020 papers, along with additional information such as affiliation type, location and subject area, and a simple visualisation tool for exploration. we also provide an initial quantitative analysis of the dataset, covering representation, engagement, common themes, and willingness to discuss potential harms alongside benefits. we investigate how these vary by geography, affiliation type and subject area. drawing on these findings, we discuss the potential benefits and negative outcomes of ethics statement requirements, and their possible causes and associated challenges. these lead us to several lessons to be learnt from the 2020 requirement: (i) the importance of creating the right incentives, (ii) the need for clear expectations and guidance, and (iii) the importance of transparency and constructive deliberation. we encourage other researchers to use our dataset to provide additional analysis, to further our understanding of how researchers responded to this requirement, and to investigate the benefits and challenges of this and related mechanisms.

2021-10-25

Anna Glazkova, Michael Kadantsev, Maksim Glazkov
Abstract: this paper describes neural models developed for the hate speech and offensive content identification in english and indo-aryan languages shared task 2021. our team called neuro-utmn-thales participated in two tasks on binary and fine-grained classification of english tweets that contain hate, offensive, and profane content (english subtasks a & b) and one task on identification of problematic content in marathi (marathi subtask a). for english subtasks, we investigate the impact of additional corpora for hate speech detection to fine-tune transformer models. we also apply a one-vs-rest approach based on twitter-roberta to discrimination between hate, profane and offensive posts. our models ranked third in english subtask a with the f1-score of 81.99% and ranked second in english subtask b with the f1-score of 65.77%. for the marathi tasks, we propose a system based on the language-agnostic bert sentence embedding (labse). this model achieved the second result in marathi subtask a obtaining an f1 of 88.08%.
Dan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li, Jacob Steinhardt
Abstract: when making everyday decisions, people are guided by their conscience, an internal sense of right and wrong. by contrast, artificial agents are currently not endowed with a moral sense. as a consequence, they may learn to behave immorally when trained on environments that ignore moral concerns, such as violent video games. with the advent of generally capable agents that pretrain on many environments, it will become necessary to mitigate inherited biases from environments that teach immoral behavior. to facilitate the development of agents that avoid causing wanton harm, we introduce jiminy cricket, an environment suite of 25 text-based adventure games with thousands of diverse, morally salient scenarios. by annotating every possible game state, the jiminy cricket environments robustly evaluate whether agents can act morally while maximizing reward. using models with commonsense moral knowledge, we create an elementary artificial conscience that assesses and guides agents. in extensive experiments, we find that the artificial conscience approach can steer agents towards moral behavior without sacrificing performance.

2021-10-24

Helen Ngo, João G. M. Araújo, Jeffrey Hui, Nicholas Frosst
Abstract: the one billion word benchmark is a dataset derived from the wmt 2011 news crawl, commonly used to measure language modeling ability in natural language processing. we train models solely on common crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. we suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.

2021-10-21

Jakob Mokander, Jessica Morley, Mariarosaria Taddeo, Luciano Floridi
Abstract: important decisions that impact human lives, livelihoods, and the natural environment are increasingly being automated. delegating tasks to so-called automated decision-making systems (adms) can improve efficiency and enable new solutions. however, these benefits are coupled with ethical challenges. for example, adms may produce discriminatory outcomes, violate individual privacy, and undermine human self-determination. new governance mechanisms are thus needed that help organisations design and deploy adms in ways that are ethical, while enabling society to reap the full economic and social benefits of automation. in this article, we consider the feasibility and efficacy of ethics-based auditing (eba) as a governance mechanism that allows organisations to validate claims made about their adms. building on previous work, we define eba as a structured process whereby an entity's present or past behaviour is assessed for consistency with relevant principles or norms. we then offer three contributions to the existing literature. first, we provide a theoretical explanation of how eba can contribute to good governance by promoting procedural regularity and transparency. second, we propose seven criteria for how to design and implement eba procedures successfully. third, we identify and discuss the conceptual, technical, social, economic, organisational, and institutional constraints associated with eba. we conclude that eba should be considered an integral component of multifaced approaches to managing the ethical risks posed by adms.

2021-10-20

Josh Kalin, David Noever, Matthew Ciolino
Abstract: machine learning and software development share processes and methodologies for reliably delivering products to customers. this work proposes the use of a new teaming construct for forming machine learning teams for better combatting adversarial attackers. in cybersecurity, infrastructure uses these teams to protect their systems by using system builders and programmers to also offer more robustness to their platforms. color teams provide clear responsibility to the individuals on each team for which part of the baseline (yellow), attack (red), and defense (blue) breakout of the pipeline. combining colors leads to additional knowledge shared across the team and more robust models built during development. the responsibilities of the new teams orange, green, and purple will be outlined during this paper along with an overview of the necessary resources for these teams to be successful.

2021-10-19

Su Lin Blodgett, Michael Madaio
Abstract: if the authors of a recent stanford report (bommasani et al., 2021) on the opportunities and risks of "foundation models" are to be believed, these models represent a paradigm shift for ai and for the domains in which they will supposedly be used, including education. although the name is new (and contested (field, 2021)), the term describes existing types of algorithmic models that are "trained on broad data at scale" and "fine-tuned" (i.e., adapted) for particular downstream tasks, and is intended to encompass large language models such as bert or gpt-3 and computer vision models such as clip. such technologies have the potential for harm broadly speaking (e.g., bender et al., 2021), but their use in the educational domain is particularly fraught, despite the potential benefits for learners claimed by the authors. in section 3.3 of the stanford report, malik et al. argue that achieving the goal of providing education for all learners requires more efficient computational approaches that can rapidly scale across educational domains and across educational contexts, for which they argue foundation models are uniquely well-suited. however, evidence suggests that not only are foundation models not likely to achieve the stated benefits for learners, but their use may also introduce new risks for harm.

2021-10-18

Carles Sierra, Nardine Osman, Pablo Noriega, Jordi Sabater-Mir, Antoni Perelló
Abstract: principles that should govern autonomous ai systems. it essentially states that a system's goals and behaviour should be aligned with human values. but how to ensure value alignment? in this paper we first provide a formal model to represent values through preferences and ways to compute value aggregations; i.e. preferences with respect to a group of agents and/or preferences with respect to sets of values. value alignment is then defined, and computed, for a given norm with respect to a given value through the increase/decrease that it results in the preferences of future states of the world. we focus on norms as it is norms that govern behaviour, and as such, the alignment of a given system with a given value will be dictated by the norms the system follows.

2021-10-16

Hao Sun, Guangxuan Xu, Jiawen Deng, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, Minlie Huang
Abstract: dialogue safety problems severely limit the real-world deployment of neural conversational models and have attracted great research interests recently. however, dialogue safety problems remain under-defined and the corresponding dataset is scarce. we propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors in human-bot dialogue settings, with focuses on context-sensitive unsafety, which is under-explored in prior works. to spur research in this direction, we compile diasafety, a dataset with rich context-sensitive unsafe examples. experiments show that existing safety guarding tools fail severely on our dataset. as a remedy, we train a dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. with our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems still exhibit concerning context-sensitive safety problems.
Nicholas Meade, Elinor Poole-Dayan, Siva Reddy
Abstract: recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. this has attracted attention to developing techniques that mitigate such biases. in this work, we perform an empirical survey of five recently proposed bias mitigation techniques: counterfactual data augmentation (cda), dropout, iterative nullspace projection, self-debias, and sentencedebias. we quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability, as well as its performance on downstream nlu tasks. we experimentally find that: (1) self-debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) current debiasing techniques perform less consistently when mitigating non-gender biases; and (3) improvements on bias benchmarks such as stereoset and crows-pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.

2021-10-15

Vijit Malik, Sunipa Dev, Akihiro Nishi, Nanyun Peng, Kai-Wei Chang
Abstract: language representations are efficient tools used across nlp applications, but they are strife with encoded societal biases. these biases are studied extensively, but with a primary focus on english language representations and biases common in the context of western society. in this work, we investigate biases present in hindi language representations with focuses on caste and religion-associated biases. we demonstrate how biases are unique to specific language representations based on the history and culture of the region they are widely spoken in, and how the same societal bias (such as binary gender-associated biases) is encoded by different words and text spans across languages. the discoveries of our work highlight the necessity of culture awareness and linguistic artifacts when modeling language representations, in order to better understand the encoded biases.
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, Samuel R. Bowman
Abstract: it is well documented that nlp models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (qa). we introduce the bias benchmark for qa (bbq), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for u.s. english-speaking contexts. our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. we find that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.
Samuel R. Bowman
Abstract: researchers in nlp often frame and discuss research results in ways that serve to deemphasize the field's successes, often in response to the field's widespread hype. though well-meaning, this has yielded many misleading or false claims about the limits of our best technology. this is a problem, and it may be more serious than it looks: it harms our credibility in ways that can make it harder to mitigate present-day harms, like those involving biased systems for content moderation or resume screening. it also limits our ability to prepare for the potentially enormous impacts of more distant future advances. this paper urges researchers to be careful about these claims and suggests some research directions and communication strategies that will make it easier to avoid or rebut them.
Bingbing Li, Hongwu Peng, Rajat Sainju, Junhuan Yang, Lei Yang, Yueying Liang, Weiwen Jiang, Binghui Wang, Hang Liu, Caiwen Ding
Abstract: in this paper, we propose a novel gender bias detection method by utilizing attention map for transformer-based models. we 1) give an intuitive gender bias judgement method by comparing the different relation degree between the genders and the occupation according to the attention scores, 2) design a gender bias detector by modifying the attention module, 3) insert the gender bias detector into different positions of the model to present the internal gender bias flow, and 4) draw the consistent gender bias conclusion by scanning the entire wikipedia, a bert pretraining dataset. we observe that 1) the attention matrices, wq and wk introduce much more gender bias than other modules (including the embedding layer) and 2) the bias degree changes periodically inside of the model (attention matrix q, k, v, and the remaining part of the attention layer (including the fully-connected layer, the residual connection, and the layer normalization module) enhance the gender bias while the averaged attentions reduces the bias).

2021-10-14

Megan Ung, Jing Xu, Y-Lan Boureau
Abstract: current open-domain conversational models can easily be made to talk in inadequate ways. online learning from conversational feedback given by the conversation partner is a promising avenue for a model to improve and adapt, so as to generate fewer of these safety failures. however, current state-of-the-art models tend to react to feedback with defensive or oblivious responses. this makes for an unpleasant experience and may discourage conversation partners from giving feedback in the future. this work proposes saferdialogues, a task and dataset of graceful responses to conversational feedback about safety failures. we collect a dataset of 10k dialogues demonstrating safety failures, feedback signaling them, and a response acknowledging the feedback. we show how fine-tuning on this dataset results in conversations that human raters deem considerably more likely to lead to a civil conversation, without sacrificing engagingness or general conversational ability.
Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi
Abstract: as ai systems become increasingly powerful and pervasive, there are growing concerns about machines' morality or a lack thereof. yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for ai. existing ai systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it. to explore this challenge, we introduce delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. empirical results shed novel insights on the promises and limits of machine ethics; delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense. yet, delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. despite that, we demonstrate positive use cases of imperfect delphi, including using it as a component model within other imperfect ai systems. importantly, we interpret the operationalization of delphi in light of prominent ethical theories, which leads us to important future research questions.

2021-10-13

Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders
Abstract: in many contexts, lying -- the use of verbal falsehoods to deceive -- is harmful. while lying has traditionally been a human affair, ai systems that make sophisticated verbal statements are becoming increasingly prevalent. this raises the question of how we should limit the harm caused by ai "lies" (i.e. falsehoods that are actively selected for). human truthfulness is governed by social norms and by laws (against defamation, perjury, and fraud). differences between ai and humans present an opportunity to have more precise standards of truthfulness for ai, and to have these standards rise over time. this could provide significant benefits to public epistemics and the economy, and mitigate risks of worst-case ai futures. establishing norms or laws of ai truthfulness will require significant work to: (1) identify clear truthfulness standards; (2) create institutions that can judge adherence to those standards; and (3) develop ai systems that are robustly truthful. our initial proposals for these areas include: (1) a standard of avoiding "negligent falsehoods" (a generalisation of lies that is easier to assess); (2) institutions to evaluate ai systems before and after real-world deployment; and (3) explicitly training ai systems to be truthful via curated datasets and human interaction. a concerning possibility is that evaluation mechanisms for eventual truthfulness standards could be captured by political interests, leading to harmful censorship and propaganda. avoiding this might take careful attention. and since the scale of ai speech acts might grow dramatically over the coming decades, early truthfulness standards might be particularly important because of the precedents they set.

2021-10-12

Md Abul Bashar, Richi Nayak, Anjor Kothare, Vishal Sharma, Kesavan Kandadai
Abstract: to create a more inclusive workplace, enterprises are actively investing in identifying and eliminating unconscious bias (e.g., gender, race, age, disability, elitism and religion) across their various functions. we propose a deep learning model with a transfer learning based language model to learn from manually tagged documents for automatically identifying bias in enterprise content. we first pretrain a deep learning-based language-model using wikipedia, then fine tune the model with a large unlabelled data set related with various types of enterprise content. finally, a linear layer followed by softmax layer is added at the end of the language model and the model is trained on a labelled bias dataset consisting of enterprise content. the trained model is thoroughly evaluated on independent datasets to ensure a general application. we present the proposed method and its deployment detail in a real-world application.

2021-10-11

Christopher Burr, David Leslie
Abstract: this article offers several contributions to the interdisciplinary project of responsible research and innovation in data science and ai. first, it provides a critical analysis of current efforts to establish practical mechanisms for algorithmic assessment, which are used to operationalise normative principles, such as sustainability, accountability, transparency, fairness, and explainability, in order to identify limitations and gaps with the current approaches. second, it provides an accessible introduction to the methodology of argument-based assurance, and explores how it is currently being applied in the development of safety cases for autonomous and intelligent systems. third, it generalises this method to incorporate wider ethical, social, and legal considerations, in turn establishing a novel version of argument-based assurance that we call 'ethical assurance'. ethical assurance is presented as a structured means for unifying the myriad practical mechanisms that have been proposed, as it is built upon a process-based form of project governance that supports inclusive and participatory ethical deliberation while also remaining grounded in social and technical realities. finally, it sets an agenda for ethical assurance, by detailing current challenges, open questions, and next steps, which serve as a springboard to build an active (and interdisciplinary) research programme as well as contribute to ongoing discussions in policy and governance.
Homanga Bharadhwaj
Abstract: robots of the future are going to exhibit increasingly human-like and super-human intelligence in a myriad of different tasks. they are also likely going to fail and be incompliant with human preferences in increasingly subtle ways. towards the goal of achieving autonomous robots, the robot learning community has made rapid strides in applying machine learning techniques to train robots through data and interaction. this makes the study of how best to audit these algorithms for checking their compatibility with humans, pertinent and urgent. in this paper, we draw inspiration from the ai safety and alignment communities and make the case that we need to urgently consider ways in which we can best audit our robot learning algorithms to check for failure modes, and ensure that when operating autonomously, they are indeed behaving in ways that the human algorithm designers intend them to. we believe that this is a challenging problem that will require efforts from the entire robot learning community, and do not attempt to provide a concrete framework for auditing. instead, we outline high-level guidance and a possible approach towards formulating this framework which we hope will serve as a useful starting point for thinking about auditing in the context of robot learning.

2021-10-08

Patrick Schramowski, Kristian Kersting
Abstract: probing or fine-tuning (large-scale) pre-trained models results in state-of-the-art performance for many nlp tasks and, more recently, even for computer vision tasks when combined with image data. unfortunately, these approaches also entail severe risks. in particular, large image datasets automatically scraped from the web may contain derogatory terms as categories and offensive images, and may also underrepresent specific classes. consequently, there is an urgent need to carefully document datasets and curate their content. unfortunately, this process is tedious and error-prone. we show that pre-trained transformers themselves provide a methodology for the automated curation of large-scale vision datasets. based on human-annotated examples and the implicit knowledge of a clip based model, we demonstrate that one can select relevant prompts for rating the offensiveness of an image. in addition to e.g. privacy violation and pornographic content previously identified in imagenet, we demonstrate that our approach identifies further inappropriate and potentially offensive content.
Chris Percy, Simo Dragicevic, Sanjoy Sarkar, "Artur S. D'Avila Garcez"
Abstract: recent ai-related scandals have shed a spotlight on accountability in ai, with increasing public interest and concern. this paper draws on literature from public policy and governance to make two contributions. first, we propose an ai accountability ecosystem as a useful lens on the system, with different stakeholders requiring and contributing to specific accountability mechanisms. we argue that the present ecosystem is unbalanced, with a need for improved transparency via ai explainability and adequate documentation and process formalisation to support internal audit, leading up eventually to external accreditation processes. second, we use a case study in the gambling sector to illustrate in a subset of the overall ecosystem the need for industry-specific accountability principles and processes. we define and evaluate critically the implementation of key accountability principles in the gambling industry, namely addressing algorithmic bias and model explainability, before concluding and discussing directions for future work based on our findings. keywords: accountability, explainable ai, algorithmic bias, regulation.

2021-10-05

Abeba Birhane, Vinay Uday Prabhu, Emmanuel Kahembwe
Abstract: we have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. the rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. these address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the commoncrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as openai's clip model) trained on opaque datasets (webimagetext). in the backdrop of these specific calls of caution, we examine the recently released laion-400m dataset, which is a clip-filtered dataset of image-alt-text pairs parsed from the common-crawl dataset. we found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. we outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the ai community, regulators, policy makers and data subjects.
Hoai Nam Tran, Udo Kruschwitz
Abstract: this paper describes our approach (ur-iw-hnt) for the shared task of germeval2021 to identify toxic, engaging, and fact-claiming comments. we submitted three runs using an ensembling strategy by majority (hard) voting with multiple different bert models of three different types: german-based, twitter-based, and multilingual models. all ensemble models outperform single models, while bertweet is the winner of all individual models in every subtask. twitter-based models perform better than germanbert models, and multilingual models perform worse but by a small margin.

2021-10-04

Neel Nanda, Jonathan Uesato, Sven Gowal
Abstract: collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. as such, it is often only possible to gather a small amount of high-quality labels. in this paper, we study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels (which exhibit bias against identity groups) for predicting toxicity of online comments. we evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two. while we find that initial training on all of the data and fine-tuning on clean data produces models with the highest auc, we find that no single strategy performs best across all fairness metrics.

2021-10-03

Wenqian Ye, Fei Xu, Yaojia Huang, Cassie Huang, Ji A
Abstract: over the last few years, contextualized pre-trained neural language models, such as bert, gpt, have shown significant gains in various nlp tasks. to enhance the robustness of existing pre-trained models, one way is adversarial examples generation and evaluation for conducting data augmentation or adversarial learning. in the meanwhile, gender bias embedded in the models seems to be a serious problem in practical applications. many researches have covered the gender bias produced by word-level information(e.g. gender-stereotypical occupations), while few researchers have investigated the sentence-level cases and implicit cases. in this paper, we proposed a method to automatically generate implicit gender bias samples at sentence-level and a metric to measure gender bias. samples generated by our method will be evaluated in terms of accuracy. the metric will be used to guide the generation of examples from pre-trained models. therefore, those examples could be used to impose attacks on pre-trained models. finally, we discussed the evaluation efficacy of our generated examples on reducing gender bias for future research.

2021-10-01

Saad Hassan, Matt Huenerfauth, Cecilia Ovesdotter Alm
Abstract: much of the world's population experiences some form of disability during their lifetime. caution must be exercised while designing natural language processing (nlp) systems to prevent systems from inadvertently perpetuating ableist bias against people with disabilities, i.e., prejudice that favors those with typical abilities. we report on various analyses based on word predictions of a large-scale bert language model. statistically significant results demonstrate that people with disabilities can be disadvantaged. findings also explore overlapping forms of discrimination related to interconnected gender and race identities.
Richard J. Chen, Tiffany Y. Chen, Jana Lipkova, Judy J. Wang, Drew F. K. Williamson, Ming Y. Lu, Sharifa Sahai, Faisal Mahmood
Abstract: in the current development and deployment of many artificial intelligence (ai) systems in healthcare, algorithm fairness is a challenging problem in delivering equitable care. recent evaluation of ai models stratified across race sub-populations have revealed inequalities in how patients are diagnosed, given treatments, and billed for healthcare costs. in this perspective article, we summarize the intersectional field of fairness in machine learning through the context of current issues in healthcare, outline how algorithmic biases (e.g. - image acquisition, genetic variation, intra-observer labeling variability) arise in current clinical workflows and their resulting healthcare disparities. lastly, we also review emerging technology for mitigating bias via federated learning, disentanglement, and model explainability, and their role in ai-samd development.

2021-09-29

Moninder Singh, Gevorg Ghalachyan, Kush R. Varshney, Reginald E. Bryant
Abstract: to ensure trust in ai models, it is becoming increasingly apparent that evaluation of models must be extended beyond traditional performance metrics, like accuracy, to other dimensions, such as fairness, explainability, adversarial robustness, and distribution shift. we describe an empirical study to evaluate multiple model types on various metrics along these dimensions on several datasets. our results show that no particular model type performs well on all dimensions, and demonstrate the kinds of trade-offs involved in selecting models evaluated along multiple dimensions.

2021-09-28

Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt
Abstract: machine learning (ml) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. as with other powerful technologies, safety for ml should be a leading research priority. in response to emerging safety challenges in ml, such as those introduced by recent large-scale models, we provide a new roadmap for ml safety and refine the technical problems that the field needs to address. we present four problems ready for research, namely withstanding hazards ("robustness"), identifying hazards ("monitoring"), reducing inherent model hazards ("alignment"), and reducing systemic hazards ("systemic safety"). throughout, we clarify each problem's motivation and provide concrete research directions.

2021-09-27

Matan Halevy, Camille Harris, Amy Bruckman, Diyi Yang, Ayanna Howard
Abstract: recent research has demonstrated how racial biases against users who write african american english exists in popular toxic language datasets. while previous work has focused on a single fairness criteria, we propose to use additional descriptive fairness metrics to better understand the source of these biases. we demonstrate that different benchmark classifiers, as well as two in-process bias-remediation techniques, propagate racial biases even in a larger corpus. we then propose a novel ensemble-framework that uses a specialized classifier that is fine-tuned to the african american english dialect. we show that our proposed framework substantially reduces the racial biases that the model learns from these datasets. we demonstrate how the ensemble framework improves fairness metrics across all sample datasets with minimal impact on the classification performance, and provide empirical evidence in its ability to unlearn the annotation biases towards authors who use african american english. ** please note that this work may contain examples of offensive words and phrases.

2021-09-23

Ángel Alexander Cabrera, Abraham J. Druck, Jason I. Hong, Adam Perer
Abstract: ai systems can fail to learn important behaviors, leading to real-world issues like safety concerns and biases. discovering these systematic failures often requires significant developer attention, from hypothesizing potential edge cases to collecting evidence and validating patterns. to scale and streamline this process, we introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect ai errors. we also design and implement deblinder, a visual analytics system for synthesizing failure reports that developers can use to discover and validate systematic failures. in semi-structured interviews and think-aloud studies with 10 ai practitioners, we explore the affordances of the deblinder system and the applicability of failure reports in real-world settings. lastly, we show how collecting additional data from the groups identified by developers can improve model performance.
Zexue He, Bodhisattwa Prasad Majumder, Julian Mcauley
Abstract: written language carries explicit and implicit biases that can distract from meaningful signals. for example, letters of reference may describe male and female candidates differently, or their writing style may indirectly reveal demographic characteristics. at best, such biases distract from the meaningful content of the text; at worst they can lead to unfair outcomes. we investigate the challenge of re-generating input sentences to 'neutralize' sensitive attributes while maintaining the semantic meaning of the original text (e.g. is the candidate qualified?). we propose a gradient-based rewriting framework, detect and perturb to neutralize (depen), that first detects sensitive components and masks them for regeneration, then perturbs the generation model at decoding time under a neutralizing constraint that pushes the (predicted) distribution of sensitive attributes towards a uniform distribution. our experiments in two different scenarios show that depen can regenerate fluent alternatives that are neutral in the sensitive attribute while maintaining the semantics of other attributes.

2021-09-21

Shivashankar Subramanian, Xudong Han, Timothy Baldwin, Trevor Cohn, Lea Frermann
Abstract: bias is pervasive in nlp models, motivating the development of automatic debiasing techniques. evaluation of nlp debiasing methods has largely been limited to binary attributes in isolation, e.g., debiasing with respect to binary gender or race, however many corpora involve multiple such attributes, possibly with higher cardinality. in this paper we argue that a truly fair model must consider `gerrymandering' groups which comprise not only single attributes, but also intersectional groups. we evaluate a form of bias-constrained model which is new to nlp, as well an extension of the iterative nullspace projection technique which can handle multiple protected attributes.
N/A Ashwin, William Agnew, Umut Pajaro, Hetvi Jethwani, Arjun Subramonian
Abstract: trustworthy artificial intelligence (ai) has become an important topic because trust in ai systems and their creators has been lost. researchers, corporations, and governments have long and painful histories of excluding marginalized groups from technology development, deployment, and oversight. as a result, these technologies are less useful and even harmful to minoritized groups. we argue that any ai development, deployment, and monitoring framework that aspires to trust must incorporate both feminist, non-exploitative participatory design principles and strong, outside, and continual monitoring and testing. we additionally explain the importance of considering aspects of trustworthiness beyond just transparency, fairness, and accountability, specifically, to consider justice and shifting power to the disempowered as core values to any trustworthy ai system. creating trustworthy ai starts by funding, supporting, and empowering grassroots organizations like queer in ai so the field of ai has the diversity and inclusion to credibly and effectively develop trustworthy ai. we leverage the expert knowledge queer in ai has developed through its years of work and advocacy to discuss if and how gender, sexuality, and other aspects of queer identity should be used in datasets and ai systems and how harms along these lines should be mitigated. based on this, we share a gendered approach to ai and further propose a queer epistemology and analyze the benefits it can bring to ai. we additionally discuss how to regulate ai with this queer epistemology in vision, proposing frameworks for making policies related to ai & gender diversity and privacy & queer data protection.

2021-09-20

Subbarao Kambhampati, Sarath Sreedharan, Mudit Verma, Yantian Zha, Lin Guan
Abstract: despite the surprising power of many modern ai systems that often learn their own representations, there is significant discontent about their inscrutability and the attendant problems in their ability to interact with humans. while alternatives such as neuro-symbolic approaches have been proposed, there is a lack of consensus on what they are about. there are often two independent motivations (i) symbols as a lingua franca for human-ai interaction and (ii) symbols as system-produced abstractions used by the ai system in its internal reasoning. the jury is still out on whether ai systems will need to use symbols in their internal reasoning to achieve general intelligence capabilities. whatever the answer there is, the need for (human-understandable) symbols in human-ai interaction seems quite compelling. symbols, like emotions, may well not be sine qua non for intelligence per se, but they will be crucial for ai systems to interact with us humans -- as we can neither turn off our emotions nor get by without our symbols. in particular, in many human-designed domains, humans would be interested in providing explicit (symbolic) knowledge and advice -- and expect machine explanations in kind. this alone requires ai systems to to maintain a symbolic interface for interaction with humans. in this blue sky paper, we argue this point of view, and discuss research directions that need to be pursued to allow for this type of human-ai interaction.
Varun Chandrasekaran, Hengrui Jia, Anvith Thudi, Adelin Travers, Mohammad Yaghini, Nicolas Papernot
Abstract: the application of machine learning (ml) in computer systems introduces not only many benefits but also risks to society. in this paper, we develop the concept of ml governance to balance such benefits and risks, with the aim of achieving responsible applications of ml. our approach first systematizes research towards ascertaining ownership of data and models, thus fostering a notion of identity specific to ml systems. building on this foundation, we use identities to hold principals accountable for failures of ml systems through both attribution and auditing. to increase trust in ml systems, we then survey techniques for developing assurance, i.e., confidence that the system meets its security requirements and does not exhibit certain known failures. this leads us to highlight the need for techniques that allow a model owner to manage the life cycle of their system, e.g., to patch or retire their ml system. put altogether, our systematization of knowledge standardizes the interactions between principals involved in the deployment of ml throughout its life cycle. we highlight opportunities for future work, e.g., to formalize the resulting game between ml principals.

2021-09-18

David Dale, Anton Voronov, Daryna Dementieva, Varvara Logacheva, Olga Kozlova, Nikita Semenov, Alexander Panchenko
Abstract: we present two novel unsupervised methods for eliminating toxicity in text. our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. we use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. our second method uses bert to replace toxic words with their non-offensive synonyms. we make the method more flexible by enabling bert to replace mask tokens with a variable number of words. finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. we compare our models with a number of methods for style transfer. the models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. both methods we suggest yield new sota results.

2021-09-17

Fei Tan, Yifan Hu, Kevin Yen, Changwei Hu
Abstract: text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. in this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. specifically, we propose a new concept {\it text toxicity propensity} to characterize the extent to which a text tends to attract toxic comments. beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. we also propose an explanation method to communicate the model decision clearly. both propensity scoring and interpretation benefit text moderation in a novel manner. finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.

2021-09-16

Xudong Han, Timothy Baldwin, Trevor Cohn
Abstract: group bias in natural language processing tasks manifests as disparities in system error rates across texts authorized by different demographic groups, typically disadvantaging minority groups. dataset balancing has been shown to be effective at mitigating bias, however existing approaches do not directly account for correlations between author demographics and linguistic variables, limiting their effectiveness. to achieve equal opportunity fairness, such as equal job opportunity without regard to demographics, this paper introduces a simple, but highly effective, objective for countering bias using balanced training. we extend the method in the form of a gated model, which incorporates protected attributes as input, and show that it is effective at reducing bias in predictions through demographic input perturbation, outperforming all other bias mitigation techniques when combined with balanced training.

2021-09-15

Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, Po-Sen Huang
Abstract: large language models (lm) generate remarkably fluent text and can be efficiently adapted across nlp tasks. measuring and guaranteeing the quality of generated text in terms of safety is imperative for deploying lms in the real world; to this end, prior work often relies on automatic evaluation of lm toxicity. we critically discuss this approach, evaluate several toxicity mitigation strategies with respect to both automatic and human evaluation, and analyze consequences of toxicity mitigation in terms of model bias and lm quality. we demonstrate that while basic intervention strategies can effectively optimize previously established automatic metrics on the realtoxicityprompts dataset, this comes at the cost of reduced lm coverage for both texts about, and dialects of, marginalized groups. additionally, we find that human raters often disagree with high automatic toxicity scores after strong toxicity reduction interventions -- highlighting further the nuances involved in careful evaluation of lm toxicity.

2021-09-14

Tenghao Huang, Faeze Brahman, Vered Shwartz, Snigdha Chaturvedi
Abstract: pre-trained language models learn socially harmful biases from their training corpora, and may repeat these biases when used for generation. we study gender biases associated with the protagonist in model-generated stories. such biases may be expressed either explicitly ("women can't park") or implicitly (e.g. an unsolicited male character guides her into a parking space). we focus on implicit biases, and use a commonsense reasoning engine to uncover them. specifically, we infer and analyze the protagonist's motivations, attributes, mental states, and implications on others. our findings regarding implicit biases are in line with prior work that studied explicit biases, for example showing that female characters' portrayal is centered around appearance, while male figures' focus on intellect.
Dian Yu, Kenji Sagae
Abstract: neural dialog models are known to suffer from problems such as generating unsafe and inconsistent responses. even though these problems are crucial and prevalent, they are mostly manually identified by model designers through interactions. recently, some research instructs crowdworkers to goad the bots into triggering such problems. however, humans leverage superficial clues such as hate speech, while leaving systematic problems undercover. in this paper, we propose two methods including reinforcement learning to automatically trigger a dialog model into generating problematic responses. we show the effect of our methods in exposing safety and contradiction issues with state-of-the-art dialog models.

2021-09-13

Jaimeen Ahn, Alice Oh
Abstract: bert and other large-scale language models (lms) contain gender and racial bias. they also exhibit other dimensions of social bias, most of which have not been studied in depth, and some of which vary depending on the language. in this paper, we study ethnic bias and how it varies across languages by analyzing and mitigating ethnic bias in monolingual bert for english, german, spanish, korean, turkish, and chinese. to observe and quantify ethnic bias, we develop a novel metric called categorical bias score. then we propose two methods for mitigation; first using a multilingual model, and second using contextual word alignment of two monolingual models. we compare our proposed methods with monolingual bert and show that these methods effectively alleviate the ethnic bias. which of the two methods works better depends on the amount of nlp resources available for that language. we additionally experiment with arabic and greek to verify that our proposed methods work for a wider variety of languages.

2021-09-12

Abigail Z. Jacobs
Abstract: measurement of social phenomena is everywhere, unavoidably, in sociotechnical systems. this is not (only) an academic point: fairness-related harms emerge when there is a mismatch in the measurement process between the thing we purport to be measuring and the thing we actually measure. however, the measurement process -- where social, cultural, and political values are implicitly encoded in sociotechnical systems -- is almost always obscured. furthermore, this obscured process is where important governance decisions are encoded: governance about which systems are fair, which individuals belong in which categories, and so on. we can then use the language of measurement, and the tools of construct validity and reliability, to uncover hidden governance decisions. in particular, we highlight two types of construct validity, content validity and consequential validity, that are useful to elicit and characterize the feedback loops between the measurement, social construction, and enforcement of social categories. we then explore the constructs of fairness, robustness, and responsibility in the context of governance in and for responsible ai. together, these perspectives help us unpack how measurement acts as a hidden governance process in sociotechnical systems. understanding measurement as governance supports a richer understanding of the governance processes already happening in ai -- responsible or otherwise -- revealing paths to more effective interventions.

2021-09-10

Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia
Abstract: transformer-based models such as bert, xlnet, and xlm-r have achieved state-of-the-art performance across various nlp tasks including the identification of offensive language and hate speech, an important problem in social media. in this paper, we present fbert, a bert model retrained on solid, the largest english offensive language identification corpus available with over $1.4$ million offensive instances. we evaluate fbert's performance on identifying offensive content on multiple english datasets and we test several thresholds for selecting instances from solid. the fbert model will be made freely available to the community.

2021-09-09

Michael Mendelson, Yonatan Belinkov
Abstract: model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. recent debiasing methods in natural language understanding (nlu) improve performance on such datasets by pressuring models into making unbiased predictions. an underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model's inner representations. we propose a general probing-based framework that allows for post-hoc interpretation of biases in language models, and use an information-theoretic approach to measure the extractability of certain biases from the model's representations. we experiment with several nlu datasets and known biases, and show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.
Francesco Sovrano, Fabio Vitali, Monica Palmirani
Abstract: through the general data protection regulation (gdpr), the european union has set out its vision for automated decision- making (adm) and ai, which must be reliable and human-centred. in particular we are interested on the right to explanation, that requires industry to produce explanations of adm. the high-level expert group on artificial intelligence (ai-hleg), set up to support the implementation of this vision, has produced guidelines discussing the types of explanations that are appropriate for user-centred (interactive) explanatory tools. in this paper we propose our version of explanatory narratives (en), based on user-centred concepts drawn from iso 9241, as a model for user-centred explanations aligned with the gdpr and the ai-hleg guidelines. through the use of ens we convert the problem of generating explanations for adm into the identification of an appropriate path over an explanatory space, allowing explainees to interactively explore it and produce the explanation best suited to their needs. to this end we list suitable exploration heuristics, we study the properties and structure of explanations, and discuss the proposed model identifying its weaknesses and strengths.

2021-09-08

Anne Lauscher, Tobias Lüken, Goran Glavaš
Abstract: unfair stereotypical biases (e.g., gender, racial, or religious biases) encoded in modern pretrained language models (plms) have negative ethical implications for widespread adoption of state-of-the-art language technology. to remedy for this, a wide range of debiasing techniques have recently been introduced to remove such stereotypical biases from plms. existing debiasing methods, however, directly modify all of the plms parameters, which -- besides being computationally expensive -- comes with the inherent risk of (catastrophic) forgetting of useful language knowledge acquired in pretraining. in this work, we propose a more sustainable modular debiasing approach based on dedicated debiasing adapters, dubbed adele. concretely, we (1) inject adapter modules into the original plm layers and (2) update only the adapters (i.e., we keep the original plm parameters frozen) via language modeling training on a counterfactually augmented corpus. we showcase adele, in gender debiasing of bert: our extensive evaluation, encompassing three intrinsic and two extrinsic bias measures, renders adele, very effective in bias mitigation. we further show that -- due to its modular nature -- adele, coupled with task adapters, retains fairness even after large-scale downstream training. finally, by means of multilingual bert, we successfully transfer adele, to six target languages.
Shahar Levy, Koren Lazar, Gabriel Stanovsky
Abstract: recent works have found evidence of gender bias in models of machine translation and coreference resolution using mostly synthetic diagnostic datasets. while these quantify bias in a controlled experiment, they often do so on a small scale and consist mostly of artificial, out-of-distribution sentences. in this work, we find grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments (e.g., female nurses versus male dancers) in corpora from three domains, resulting in a first large-scale gender bias dataset of 108k diverse real-world english sentences. we manually verify the quality of our corpus and use it to evaluate gender bias in various coreference resolution and machine translation models. we find that all tested models tend to over-rely on gender stereotypes when presented with natural inputs, which may be especially harmful when deployed in commercial systems. finally, we show that our dataset lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set. our dataset and models are publicly available at www.github.com/slab-nlp/bug. we hope they will spur future research into gender bias evaluation mitigation techniques in realistic settings.
Stephanie Lin, Jacob Hilton, Owain Evans
Abstract: we propose a benchmark to measure whether a language model is truthful in generating answers to questions. the benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. we crafted questions that some humans would answer falsely due to a false belief or misconception. to perform well, models must avoid generating false answers learned from imitating human texts. we tested gpt-3, gpt-neo/j, gpt-2 and a t5-based model. the best model was truthful on 58% of questions, while human performance was 94%. models generated many false answers that mimic popular misconceptions and have the potential to deceive humans. the largest models were generally the least truthful. this contrasts with other nlp tasks, where performance improves with model size. however, this result is expected if false answers are learned from the training distribution. we suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web.

2021-09-07

Thomas P Quinn, Simon Coghlan
Abstract: medical students will almost inevitably encounter powerful medical ai systems early in their careers. yet, contemporary medical education does not adequately equip students with the basic clinical proficiency in medical ai needed to use these tools safely and effectively. education reform is urgently needed, but not easily implemented, largely due to an already jam-packed medical curricula. in this article, we propose an education reform framework as an effective and efficient solution, which we call the embedded ai ethics education framework. unlike other calls for education reform to accommodate ai teaching that are more radical in scope, our framework is modest and incremental. it leverages existing bioethics or medical ethics curricula to develop and deliver content on the ethical issues associated with medical ai, especially the harms of technology misuse, disuse, and abuse that affect the risk-benefit analyses at the heart of healthcare. in doing so, the framework provides a simple tool for going beyond the "what?" and the "why?" of medical ai ethics education, to answer the "how?", giving universities, course directors, and/or professors a broad road-map for equipping their students with the necessary clinical proficiency in medical ai.
Mudit Chaudhary, Chandni Saxena, Helen Meng
Abstract: online hate speech has caught everyone's attention from the news related to the covid-19 pandemic, us elections, and worldwide protests. online toxicity - an umbrella term for online hateful behavior, manifests itself in forms such as online hate speech. hate speech is a deliberate attack directed towards an individual or a group motivated by the targeted entity's identity or opinions. the rising mass communication through social media further exacerbates the harmful consequences of online hate speech. while there has been significant research on hate-speech identification using natural language processing (nlp), the work on utilizing nlp for prevention and intervention of online hate speech lacks relatively. this paper presents a holistic conceptual framework on hate-speech nlp countering methods along with a thorough survey on the current progress of nlp for countering online hate speech. it classifies the countering techniques based on their time of action, and identifies potential future research areas on this topic.
Michaela Hardt, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick Mccarthy, Ashish Rathi, Scott Rees, Ankit Siva, Erhyuan Tsai, Keerthan Vasist, Pinar Yilmaz, Muhammad Bilal Zafar, Sanjiv Das, Kevin Haas, Tyler Hill, Krishnaram Kenthapadi
Abstract: understanding the predictions made by machine learning (ml) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. we present amazon sagemaker clarify, an explainability feature for amazon sagemaker that launched in december 2020, providing insights into data and ml models by identifying biases and explaining predictions. it is deeply integrated into amazon sagemaker, a fully managed service that enables data scientists and developers to build, train, and deploy ml models at any scale. clarify supports bias detection and feature importance computation across the ml lifecycle, during data preparation, model evaluation, and post-deployment monitoring. we outline the desiderata derived from customer input, the modular architecture, and the methodology for bias and explanation computations. further, we describe the technical challenges encountered and the tradeoffs we had to make. for illustration, we discuss two customer use cases. we present our deployment results including qualitative customer feedback and a quantitative evaluation. finally, we summarize lessons learned, and discuss best practices for the successful adoption of fairness and explanation tools in practice.
Eric Michael Smith, Adina Williams
Abstract: all ai models are susceptible to learning biases in data that they are trained on. for generative dialogue models, being trained on real human conversations containing unbalanced gender and race/ethnicity references can lead to models that display learned biases, which we define here broadly as any measurable differences in the distributions of words or semantic content of conversations based on demographic groups. we measure the strength of such biases by producing artificial conversations between two copies of a dialogue model, conditioning one conversational partner to state a name commonly associated with a certain gender and/or race/ethnicity. we find that larger capacity models tend to exhibit more gender bias and greater stereotyping of occupations by gender. we show that several methods of tuning these dialogue models, specifically name scrambling, controlled generation, and unlikelihood training, are effective in reducing bias in conversation, including on a downstream conversational task. name scrambling is also effective in lowering differences in token usage across conversations where partners have names associated with different genders or races/ethnicities.

2021-09-05

Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh, Ahmad Rashid
Abstract: existing natural language understanding (nlu) models have been shown to incorporate dataset biases leading to strong performance on in-distribution (id) test sets but poor performance on out-of-distribution (ood) ones. we introduce a simple yet effective debiasing framework whereby the shallow representations of the main model are used to derive a bias model and both models are trained simultaneously. we demonstrate on three well studied nlu tasks that despite its simplicity, our method leads to competitive ood results. it significantly outperforms other debiasing approaches on two tasks, while still delivering high in-distribution performance.
Shiri Dori-Hacohen, Roberto Montenegro, Fabricio Murai, Scott A. Hale, Keen Sung, Michela Blain, Jennifer Edwards-Johnson
Abstract: most fairness in ai research focuses on exposing biases in ai systems. a broader lens on fairness reveals that ai can serve a greater aspiration: rooting out societal inequities from their source. specifically, we focus on inequities in health information, and aim to reduce bias in that domain using ai. the ai algorithms under the hood of search engines and social media, many of which are based on recommender systems, have an outsized impact on the quality of medical and health information online. therefore, embedding bias detection and reduction into these recommender systems serving up medical and health content online could have an outsized positive impact on patient outcomes and wellbeing. in this position paper, we offer the following contributions: (1) we propose a novel framework of fairness via ai, inspired by insights from medical education, sociology and antiracism; (2) we define a new term, bisinformation, which is related to, but distinct from, misinformation, and encourage researchers to study it; (3) we propose using ai to study, detect and mitigate biased, harmful, and/or false health information that disproportionately hurts minority groups in society; and (4) we suggest several pillars and pose several open problems in order to seed inquiry in this new space. while part (3) of this work specifically focuses on the health domain, the fundamental computer science advances and contributions stemming from research efforts in bias reduction and fairness via ai have broad implications in all areas of society.

2021-09-01

Tomer Wullach, Amir Adler, Einat Minkov
Abstract: automatic hate speech detection is hampered by the scarcity of labeled datasetd, leading to poor generalization. we employ pretrained language models (lms) to alleviate this data bottleneck. we utilize the gpt lm for generating large amounts of synthetic hate speech sequences from available labeled examples, and leverage the generated data in fine-tuning large pretrained lms on hate detection. an empirical study using the models of bert, roberta and albert, shows that this approach improves generalization significantly and consistently within and across data distributions. in fact, we find that generating relevant labeled hate speech sequences is preferable to using out-of-domain, and sometimes also within-domain, human-labeled examples.

2021-08-31

Mattias Wahde, Marco Virgolin
Abstract: in this position paper, we present five key principles, namely interpretability, inherent capability to explain, independent data, interactive learning, and inquisitiveness, for the development of conversational ai that, unlike the currently popular black box approaches, is transparent and accountable. at present, there is a growing concern with the use of black box statistical language models: while displaying impressive average performance, such systems are also prone to occasional spectacular failures, for which there is no clear remedy. in an effort to initiate a discussion on possible alternatives, we outline and exemplify how our five principles enable the development of conversational ai systems that are transparent and thus safer for use. we also present some of the challenges inherent in the implementation of those principles.
Maxine Major, Brian Souza, Joseph Divita, Kimberly Ferguson-Walter
Abstract: the performance of artificial intelligence (ai) algorithms in practice depends on the realism and correctness of the data, models, and feedback (labels or rewards) provided to the algorithm. this paper discusses methods for improving the realism and ecological validity of ai used for autonomous cyber defense by exploring the potential to use inverse reinforcement learning (irl) to gain insight into attacker actions, utilities of those actions, and ultimately decision points which cyber deception could thwart. the tularosa study, as one example, provides experimental data of real-world techniques and tools commonly used by attackers, from which core data vectors can be leveraged to inform an autonomous cyber defense system.

2021-08-29

Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, Zhou Yu
Abstract: with the increasing applications of language models, it has become crucial to protect these models from leaking private information. previous work has attempted to tackle this challenge by training rnn-based language models with differential privacy guarantees. however, applying classical differential privacy to language models leads to poor model performance as the underlying privacy notion is over-pessimistic and provides undifferentiated protection for all tokens in the data. given that the private information in natural language is sparse (for example, the bulk of an email might not carry personally identifiable information), we propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data to improve model utility. to realize such a new notion, we develop a corresponding privacy mechanism, selective-dpsgd, for rnn-based language models. besides language modeling, we also apply the method to a more concrete application--dialog systems. experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities while remaining safe under various privacy attacks compared to the baselines. the data and code are released at https://github.com/wyshi/lm_privacy to facilitate future research .

2021-08-28

Jess Whittlestone, Jack Clark
Abstract: in this paper we outline a proposal for improving the governance of artificial intelligence (ai) by investing in government capacity to systematically measure and monitor the capabilities and impacts of ai systems. if adopted, this would give governments greater information about the ai ecosystem, equipping them to more effectively direct ai development and deployment in the most societally and economically beneficial directions. it would also create infrastructure that could rapidly identify potential threats or harms that could occur as a consequence of changes in the ai ecosystem, such as the emergence of strategically transformative capabilities, or the deployment of harmful systems. we begin by outlining the problem which motivates this proposal: in brief, traditional governance approaches struggle to keep pace with the speed of progress in ai. we then present our proposal for addressing this problem: governments must invest in measurement and monitoring infrastructure. we discuss this proposal in detail, outlining what specific things governments could focus on measuring and monitoring, and the kinds of benefits this would generate for policymaking. finally, we outline some potential pilot projects and some considerations for implementing this in practice.

2021-08-26

Ashutosh Baheti, Maarten Sap, Alan Ritter, Mark Riedl
Abstract: dialogue models trained on human conversations inadvertently learn to generate toxic responses. in addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. to better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive reddit conversations. specifically, we create toxichat, a crowd-annotated dataset of 2,000 reddit threads and model responses labeled with offensive language and stance. our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. this undesirable behavior is learned by neural dialogue models, such as dialogpt, which we show are two times more likely to agree with offensive comments. to enable automatic detection of offensive language, we fine-tuned transformer-based classifiers on toxichat that achieve 0.71 f1 for offensive labels and 0.53 macro-f1 for stance labels. finally, we quantify the effectiveness of controllable text generation (ctg) methods to mitigate the tendency of neural dialogue models to agree with offensive comments. compared to the baseline, our best ctg model achieves a 19% reduction in agreement with offensive comments and produces 29% fewer offensive replies. our work highlights the need for further efforts to characterize and analyze inappropriate behavior in dialogue models, in order to help make them safer. our code and corpus are available at https://github.com/abaheti95/toxichat .
Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff M Phillips, Kai-Wei Chang
Abstract: gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. however, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. these harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. in this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in english language technologies. we also detail how current language representations (e.g., glove, bert) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.

2021-08-24

Andrea Madotto
Abstract: this thesis investigates the controllability of deep learning-based, end-to-end, generative dialogue systems in both task-oriented and chit-chat scenarios. in particular, we study the different aspects of controlling generative dialogue systems, including controlling styles and topics and continuously adding and combining dialogue skills. in the three decades since the first dialogue system was commercialized, the basic architecture of such systems has remained substantially unchanged, consisting of four pipelined basic components, namely, natural language understanding (nlu), dialogue state tracking (dst), a dialogue manager (dm) and natural language generation (nlg). the dialogue manager, which is the critical component of the modularized system, controls the response content and style. this module is usually programmed by rules and is designed to be highly controllable and easily extendable. with the emergence of powerful "deep learning" architectures, end-to-end generative dialogue systems have been proposed to optimize overall system performance and simplify training. however, these systems cannot be easily controlled and extended as the modularized dialogue manager can. this is because a single neural system is used, which is usually a large pre-trained language model (e.g., gpt-2), and thus it is hard to surgically change desirable attributes (e.g., style, topics, etc.). more importantly, uncontrollable dialogue systems can generate offensive and even toxic responses. therefore, in this thesis, we study controllable methods for end-to-end generative dialogue systems in task-oriented and chit-chat scenarios. throughout the chapters, we describe 1) how to control the style and topics of chit-chat models, 2) how to continuously control and extend task-oriented dialogue systems, and 3) how to compose and control multi-skill dialogue models.

2021-08-23

Isabelle Augenstein
Abstract: the past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. this development has spurred research in the area of automatic fact checking, from approaches to detect check-worthy claims and determining the stance of tweets towards claims, to methods to determine the veracity of claims given evidence documents. these automatic methods are often content-based, using natural language processing methods, which in turn utilise deep neural networks to learn higher-order features from text in order to make predictions. as deep neural networks are black-box models, their inner workings cannot be easily explained. at the same time, it is desirable to explain how they arrive at certain decisions, especially if they are to be used for decision making. while this has been known for some time, the issues this raises have been exacerbated by models increasing in size, and by eu legislation requiring models to be used for decision making to provide explanations, and, very recently, by legislation requiring online platforms operating in the eu to provide transparent reporting on their services. despite this, current solutions for explainability are still lacking in the area of fact checking. this thesis presents my research on automatic fact checking, including claim check-worthiness detection, stance detection and veracity prediction. its contributions go beyond fact checking, with the thesis proposing more general machine learning solutions for natural language processing in the area of learning with limited labelled data. finally, the thesis presents some first solutions for explainable fact checking.

2021-08-20

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri
Abstract: there is burgeoning interest in designing ai-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. the most notable of these comes in the form of the first self-described `ai pair programmer', github copilot, a language model trained over open-source github code. however, code often contains bugs - and so, given the vast quantity of unvetted code that copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. this raises concerns on the security of copilot's code contributions. in this work, we systematically investigate the prevalence and conditions that can cause github copilot to recommend insecure code. to perform this analysis we prompt copilot to generate code in scenarios relevant to high-risk cwes (e.g. those from mitre's "top 25" list). we explore copilot's performance on three distinct code generation axes -- examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. in total, we produce 89 different scenarios for copilot to complete, producing 1,689 programs. of these, we found approximately 40% to be vulnerable.
Paolo Bova, Jonas Emanuel Müller, Benjamin Harack
Abstract: society could soon see transformative artificial intelligence (tai). models of competition for tai show firms face strong competitive pressure to deploy tai systems before they are safe. this paper explores a proposed solution to this problem, a windfall clause, where developers commit to donating a significant portion of any eventual extremely large profits to good causes. however, a key challenge for a windfall clause is that firms must have reason to join one. firms must also believe these commitments are credible. we extend a model of tai competition with a windfall clause to show how firms and policymakers can design a windfall clause which overcomes these challenges. encouragingly, firms benefit from joining a windfall clause under a wide range of scenarios. we also find that firms join the windfall clause more often when the competition is more dangerous. even when firms learn each other's capabilities, firms rarely wish to withdraw their support for the windfall clause. these three findings strengthen the case for using a windfall clause to promote the safe development of tai.

2021-08-16

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney Von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, Percy Liang
Abstract: ai is undergoing a paradigm shift with the rise of models (e.g., bert, dall-e, gpt-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. we call these models foundation models to underscore their critically central yet incomplete character. this report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. to tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

2021-08-14

Ayush Kumar, Pratik Kumar
Abstract: with surge in online platforms, there has been an upsurge in the user engagement on these platforms via comments and reactions. a large portion of such textual comments are abusive, rude and offensive to the audience. with machine learning systems in-place to check such comments coming onto platform, biases present in the training data gets passed onto the classifier leading to discrimination against a set of classes, religion and gender. in this work, we evaluate different classifiers and feature to estimate the bias in these classifiers along with their performance on downstream task of toxicity classification. results show that improvement in performance of automatic toxic comment detection models is positively correlated to mitigating biases in these models. in our work, lstm with attention mechanism proved to be a better modelling strategy than a cnn model. further analysis shows that fasttext embeddings is marginally preferable than glove embeddings on training models for toxicity comment detection. deeper analysis reveals the findings that such automatic models are particularly biased to specific identity groups even though the model has a high auc score. finally, in effort to mitigate bias in toxicity detection models, a multi-task setup trained with auxiliary task of toxicity sub-types proved to be useful leading to upto 0.26% (6% relative) gain in auc scores.

2021-08-12

Hannah Rose Kirk, Bertram Vidgen, Paul Röttger, Tristan Thrush, Scott A. Hale
Abstract: detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. emoji-based hate is an emerging challenge for automated detection. we present hatemojicheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. using the test suite, we expose weaknesses in existing hate detection models. to address these weaknesses, we create the hatemojibuild dataset using a human-and-model-in-the-loop approach. models built with these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. both hatemojicheck and hatemojibuild are made publicly available. see our github repository (https://github.com/hannahkirk/hatemoji). hatemojicheck, hatemojibuild, and the final hatemoji model are also available on huggingface (https://huggingface.co/datasets/hannahrosekirk/).

2021-08-11

Jiahao Chen, Victor Storchan, Eren Kurshan
Abstract: we review practical challenges in building and deploying ethical ai at the scale of contemporary industrial and societal uses. apart from the purely technical concerns that are the usual focus of academic research, the operational challenges of inconsistent regulatory pressures, conflicting business goals, data quality issues, development processes, systems integration practices, and the scale of deployment all conspire to create new ethical risks. such ethical concerns arising from these practical considerations are not adequately addressed by existing research results. we argue that a holistic consideration of ethics in the development and deployment of ai systems is necessary for building ethical ai in practice, and exhort researchers to consider the full operational contexts of ai systems when assessing ethical risks.

2021-08-10

Arnav Kartikeya
Abstract: trust between humans and artificial intelligence(ai) is an issue which has implications in many fields of human computer interaction. the current issue with artificial intelligence is a lack of transparency into its decision making, and literature shows that increasing transparency increases trust. explainable artificial intelligence has the ability to increase transparency of ai, which could potentially increase trust for humans. this paper attempts to use the task of predicting yelp review star ratings with assistance from an explainable and non explainable artificial intelligence to see if trust is increased with increased transparency. results show that for these tasks, explainable artificial intelligence provided significant increase in trust as a measure of influence.

2021-08-07

Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, Kai-Wei Chang
Abstract: recent studies show that natural language processing (nlp) technologies propagate societal biases about demographic groups associated with attributes such as gender, race, and nationality. to create interventions and mitigate these biases and associated harms, it is vital to be able to detect and measure such biases. while existing works propose bias evaluation and mitigation methods for various tasks, there remains a need to cohesively understand the biases and the specific harms they measure, and how different measures compare with each other. to address this gap, this work presents a practical framework of harms and a series of questions that practitioners can answer to guide the development of bias measures. as a validation of our framework and documentation questions, we also present several case studies of how existing bias measures in nlp -- both intrinsic measures of bias in representations and extrinsic measures of bias of downstream applications -- can be aligned with different harms and how our proposed documentation questions facilitates more holistic understanding of what bias measures are measuring.

2021-08-05

Markus Langer, Kevin Baum, Kathrin Hartmann, Stefan Hessel, Timo Speith, Jonas Wahl
Abstract: national and international guidelines for trustworthy artificial intelligence (ai) consider explainability to be a central facet of trustworthy systems. this paper outlines a multi-disciplinary rationale for explainability auditing. specifically, we propose that explainability auditing can ensure the quality of explainability of systems in applied contexts and can be the basis for certification as a means to communicate whether systems meet certain explainability standards and requirements. moreover, we emphasize that explainability auditing needs to take a multi-disciplinary perspective, and we provide an overview of four perspectives (technical, psychological, ethical, legal) and their respective benefits with respect to explainability auditing.

2021-08-04

Helen Ngo, Cooper Raterink, João G. M. Araújo, Ivan Zhang, Carol Chen, Adrien Morisot, Nicholas Frosst
Abstract: language models trained on large-scale unfiltered datasets curated from the open web acquire systemic biases, prejudices, and harmful views from their training data. we present a methodology for programmatically identifying and removing harmful text from web-scale datasets. a pretrained language model is used to calculate the log-likelihood of researcher-written trigger phrases conditioned on a specific document, which is used to identify and filter documents from the dataset. we demonstrate that models trained on this filtered dataset exhibit lower propensity to generate harmful text, with a marginal decrease in performance on standard language modeling benchmarks compared to unfiltered baselines. we provide a partial explanation for this performance gap by surfacing examples of hate speech and other undesirable content from standard language modeling benchmarks. finally, we discuss the generalization of this method and how trigger phrases which reflect specific values can be used by researchers to build language models which are more closely aligned with their values.

2021-08-03

Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, Morteza Dehghani
Abstract: bias mitigation approaches reduce models' dependence on sensitive features of data, such as social group tokens (sgts), resulting in equal predictions across the sensitive features. in hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each sgt. here, to take the specific language about each sgt into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the sgts. our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat sgts equally only within interchangeable contexts. by applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model performance on hate speech detection.
Cécile Logé, Emily Ross, David Yaw Amoah Dadey, Saahil Jain, Adriel Saporta, Andrew Y. Ng, Pranav Rajpurkar
Abstract: recent advances in natural language processing (nlp), and specifically automated question answering (qa) systems, have demonstrated both impressive linguistic fluency and a pernicious tendency to reflect social biases. in this study, we introduce q-pain, a dataset for assessing bias in medical qa in the context of pain management, one of the most challenging forms of clinical decision-making. along with the dataset, we propose a new, rigorous framework, including a sample experimental design, to measure the potential biases present when making treatment decisions. we demonstrate its use by assessing two reference question-answering systems, gpt-2 and gpt-3, and find statistically significant differences in treatment between intersectional race-gender subgroups, thus reaffirming the risks posed by ai in medical settings, and the need for datasets like ours to ensure safety before medical ai applications are deployed.

2021-08-02

Amit Sheth, Manas Gaur, Kaushik Roy, Keyur Faldu
Abstract: ai systems have seen significant adoption in various domains. at the same time, further adoption in some domains is hindered by inability to fully trust an ai system that it will not harm a human. besides the concerns for fairness, privacy, transparency, and explainability are key to developing trusts in ai systems. as stated in describing trustworthy ai "trust comes through understanding. how ai-led decisions are made and what determining factors were included are crucial to understand." the subarea of explaining ai systems has come to be known as xai. multiple aspects of an ai system can be explained; these include biases that the data might have, lack of data points in a particular region of the example space, fairness of gathering the data, feature importances, etc. however, besides these, it is critical to have human-centered explanations that are directly related to decision-making similar to how a domain expert makes decisions based on "domain knowledge," that also include well-established, peer-validated explicit guidelines. to understand and validate an ai system's outcomes (such as classification, recommendations, predictions), that lead to developing trust in the ai system, it is necessary to involve explicit domain knowledge that humans understand and use.
Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, Mikhail Yurochkin, Moninder Singh
Abstract: the popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down-stream tasks, which have a higher potential for societal impact. the evaluation of such systems usually focuses on accuracy measures. our findings in this paper call for attention to be paid to fairness measures as well. through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks (english), we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics. specifically, we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations. at the same time, we find that little of the fairness variation is explained by model size, despite claims in the literature. to improve model fairness without retraining, we show that two post-processing methods developed for structured, tabular data can be successfully applied to a range of pretrained language models. warning: this paper contains samples of offensive text.
Jose N. Paredes, Juan Carlos L. Teze, Gerardo I. Simari, Maria Vanina Martinez
Abstract: with the availability of large datasets and ever-increasing computing power, there has been a growing use of data-driven artificial intelligence systems, which have shown their potential for successful application in diverse areas. however, many of these systems are not able to provide information about the rationale behind their decisions to their users. lack of understanding of such decisions can be a major drawback, especially in critical domains such as those related to cybersecurity. in light of this problem, in this paper we make three contributions: (i) proposal and discussion of desiderata for the explanation of outputs generated by ai-based cybersecurity systems; (ii) a comparative analysis of approaches in the literature on explainable artificial intelligence (xai) under the lens of both our desiderata and further dimensions that are typically used for examining xai approaches; and (iii) a general architecture that can serve as a roadmap for guiding research efforts towards the development of explainable ai-based cybersecurity systems -- at its core, this roadmap proposes combinations of several research lines in a novel way towards tackling the unique challenges that arise in this context.

2021-07-28

William Paul, Philippe Burlina
Abstract: assured ai in unrestricted settings is a critical problem. our framework addresses ai assurance challenges lying at the intersection of domain adaptation, fairness, and counterfactuals analysis, operating via the discovery and intervention on factors of variations in data (e.g. weather or illumination conditions) that significantly affect the robustness of ai models. robustness is understood here as insensitivity of the model performance to variations in sensitive factors. sensitive factors are traditionally set in a supervised setting, whereby factors are known a-priori (e.g. for fairness this could be factors like sex or race). in contrast, our motivation is real-life scenarios where less, or nothing, is actually known a-priori about certain factors that cause models to fail. this leads us to consider various settings (unsupervised, domain generalization, semi-supervised) that correspond to different degrees of incomplete knowledge about those factors. therefore, our two step approach works by a) discovering sensitive factors that cause ai systems to fail in a unsupervised fashion, and then b) intervening models to lessen these factor's influence. our method considers 3 interventions consisting of augmentation, coherence, and adversarial interventions (acai). we demonstrate the ability for interventions on discovered/source factors to generalize to target/real factors. we also demonstrate how adaptation to real factors of variations can be performed in the semi-supervised case where some target factor labels are known, via automated intervention selection. experiments show that our approach improves on baseline models, with regard to achieving optimal utility vs. sensitivity/robustness tradeoffs.

2021-07-26

Florian Tambon, Gabriel Laberge, Le An, Amin Nikanjam, Paulina Stevia Nouwou Mindom, Yann Pequignot, Foutse Khomh, Giulio Antoniol, Ettore Merlo, François Laviolette
Abstract: context: machine learning (ml) has been at the heart of many innovations over the past years. however, including it in so-called 'safety-critical' systems such as automotive or aeronautic has proven to be very challenging, since the shift in paradigm that ml brings completely changes traditional certification approaches. objective: this paper aims to elucidate challenges related to the certification of ml-based safety-critical systems, as well as the solutions that are proposed in the literature to tackle them, answering the question 'how to certify machine learning based safety-critical systems?'. method: we conduct a systematic literature review (slr) of research papers published between 2015 to 2020, covering topics related to the certification of ml systems. in total, we identified 217 papers covering topics considered to be the main pillars of ml certification: robustness, uncertainty, explainability, verification, safe reinforcement learning, and direct certification. we analyzed the main trends and problems of each sub-field and provided summaries of the papers extracted. results: the slr results highlighted the enthusiasm of the community for this subject, as well as the lack of diversity in terms of datasets and type of models. it also emphasized the need to further develop connections between academia and industries to deepen the domain study. finally, it also illustrated the necessity to build connections between the above mention main pillars that are for now mainly studied separately. conclusion: we highlighted current efforts deployed to enable the certification of ml based software systems, and discuss some future research directions.

2021-07-25

Pedro H. C. Avelar, Rafael B. Audibert, Anderson R. Tavares, Luís C. Lamb
Abstract: recently, the use of sound measures and metrics in artificial intelligence has become the subject of interest of academia, government, and industry. efforts towards measuring different phenomena have gained traction in the ai community, as illustrated by the publication of several influential field reports and policy documents. these metrics are designed to help decision takers to inform themselves about the fast-moving and impacting influences of key advances in artificial intelligence in general and machine learning in particular. in this paper we propose to use such newfound capabilities of ai technologies to augment our ai measuring capabilities. we do so by training a model to classify publications related to ethical issues and concerns. in our methodology we use an expert, manually curated dataset as the training set and then evaluate a large set of research papers. finally, we highlight the implications of ai metrics, in particular their contribution towards developing trustful and fair ai-based tools and technologies. keywords: ai ethics; ai fairness; ai measurement. ethics in computer science.

2021-07-22

Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, Dylan Hadfield-Menell
Abstract: we describe cases where real recommender systems were modified in the service of various human values such as diversity, fairness, well-being, time well spent, and factual accuracy. from this we identify the current practice of values engineering: the creation of classifiers from human-created data with value-based labels. this has worked in practice for a variety of issues, but problems are addressed one at a time, and users and other stakeholders have seldom been involved. instead, we look to ai alignment work for approaches that could learn complex values directly from stakeholders, and identify four major directions: useful measures of alignment, participatory design and operation, interactive value learning, and informed deliberative judgments.

2021-07-20

Eike Petersen, Yannik Potdevin, Esfandiar Mohammadi, Stephan Zidowitz, Sabrina Breyer, Dirk Nowotka, Sandra Henn, Ludwig Pechmann, Martin Leucker, Philipp Rostalski, Christian Herzog
Abstract: machine learning is expected to fuel significant improvements in medical care. to ensure that fundamental principles such as beneficence, respect for human autonomy, prevention of harm, justice, privacy, and transparency are respected, medical machine learning systems must be developed responsibly. many high-level declarations of ethical principles have been put forth for this purpose, but there is a severe lack of technical guidelines explicating the practical consequences for medical machine learning. similarly, there is currently considerable uncertainty regarding the exact regulatory requirements placed upon medical machine learning systems. this survey provides an overview of the technical and procedural challenges involved in creating medical machine learning systems responsibly and in conformity with existing regulations, as well as possible solutions to address these challenges. first, a brief review of existing regulations affecting medical machine learning is provided, showing that properties such as safety, robustness, reliability, privacy, security, transparency, explainability, and nondiscrimination are all demanded already by existing law and regulations - albeit, in many cases, to an uncertain degree. next, the key technical obstacles to achieving these desirable properties are discussed, as well as important techniques to overcome these obstacles in the medical context. we notice that distribution shift, spurious correlations, model underspecification, uncertainty quantification, and data scarcity represent severe challenges in the medical context. promising solution approaches include the use of large and representative datasets and federated learning as a means to that end, the careful exploitation of domain knowledge, the use of inherently transparent models, comprehensive out-of-distribution model testing and verification, as well as algorithmic impact assessments.

2021-07-19

Margherita Fanton, Helena Bonaldi, Serra Sinem Tekiroglu, Marco Guerini
Abstract: undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities. thus, some nlp studies have started addressing the task of counter narrative generation. although such studies have made an effort to build hate speech / counter narrative (hs/cn) datasets for neural generation, they fall short in reaching either high-quality and/or high-quantity. in this paper, we propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively by using its own data from the previous loops to generate new training samples that experts review and/or post-edit. our experiments comprised several loops including dynamic variations. results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection. to our knowledge, the resulting dataset is the only expert-based multi-target hs/cn dataset available to the community.
Angie Boggust, Benjamin Hoover, Arvind Satyanarayan, Hendrik Strobelt
Abstract: saliency methods -- techniques to identify the importance of input features on a model's output -- are a common step in understanding neural network behavior. however, interpreting saliency requires tedious manual inspection to identify and aggregate patterns in model behavior, resulting in ad hoc or cherry-picked analysis. to address these concerns, we present shared interest: metrics for comparing model reasoning (via saliency) to human reasoning (via ground truth annotations). by providing quantitative descriptors, shared interest enables ranking, sorting, and aggregating inputs, thereby facilitating large-scale systematic analysis of model behavior. we use shared interest to identify eight recurring patterns in model behavior, such as cases where contextual features or a subset of ground truth features are most important to the model. working with representative real-world users, we show how shared interest can be used to decide if a model is trustworthy, uncover issues missed in manual analyses, and enable interactive probing.
Amitoj Singh, Jingshu Chen, Lihao Zhang, Amin Rasekh, Ilana Golbin, Anand Rao
Abstract: an independent ethical assessment of an artificial intelligence system is an impartial examination of the system's development, deployment, and use in alignment with ethical values. system-level qualitative frameworks that describe high-level requirements and component-level quantitative metrics that measure individual ethical dimensions have been developed over the past few years. however, there exists a gap between the two, which hinders the execution of independent ethical assessments in practice. this study bridges this gap and designs a holistic independent ethical assessment process for a text classification model with a special focus on the task of hate speech detection. the assessment is further augmented with protected attributes mining and counterfactual-based analysis to enhance bias assessment. it covers assessments of technical performance, data bias, embedding bias, classification bias, and interpretability. the proposed process is demonstrated through an assessment of a deep hate speech detection model.

2021-07-15

Liam Magee, Lida Ghahremanlou, Karen Soldatic, Shanthi Robertson
Abstract: to examine whether intersectional bias can be observed in language generation, we examine \emph{gpt-2} and \emph{gpt-neo} models, ranging in size from 124 million to ~2.7 billion parameters. we conduct an experiment combining up to three social categories - gender, religion and disability - into unconditional or zero-shot prompts used to generate sentences that are then analysed for sentiment. our results confirm earlier tests conducted with auto-regressive causal models, including the \emph{gpt} family of models. we also illustrate why bias may be resistant to techniques that target single categories (e.g. gender, religion and race), as it can also manifest, in often subtle ways, in texts prompted by concatenated social categories. to address these difficulties, we suggest technical and community-based approaches need to combine to acknowledge and address complex and intersectional language model bias.

2021-07-14

Ramya Akula, Ivan Garibay
Abstract: algorithms are becoming more widely used in business, and businesses are becoming increasingly concerned that their algorithms will cause significant reputational or financial damage. we should emphasize that any of these damages stem from situations in which the united states lacks strict legislative prohibitions or specified protocols for measuring damages. as a result, governments are enacting legislation and enforcing prohibitions, regulators are fining businesses, and the judiciary is debating whether or not to make artificially intelligent computer models as the decision-makers in the eyes of the law. from autonomous vehicles and banking to medical care, housing, and legal decisions, there will soon be enormous amounts of algorithms that make decisions with limited human interference. governments, businesses, and society would have an algorithm audit, which would have systematic verification that algorithms are lawful, ethical, and secure, similar to financial audits. a modern market, auditing, and assurance of algorithms developed to professionalize and industrialize ai, machine learning, and related algorithms. stakeholders of this emerging field include policymakers and regulators, along with industry experts and entrepreneurs. in addition, we foresee audit thresholds and frameworks providing valuable information to all who are concerned with governance and standardization. this paper aims to review the critical areas required for auditing and assurance and spark discussion in this novel field of study and practice.

2021-07-13

Moniba Keymanesh, Tanya Berger-Wolf, Micha Elsner, Srinivasan Parthasarathy
Abstract: in consequential domains such as recidivism prediction, facility inspection, and benefit assignment, it's important for individuals to know the decision-relevant information for the model's prediction. in addition, predictions should be fair both in terms of the outcome and the justification of the outcome. in other words, decision-relevant features should provide sufficient information for the predicted outcome and should be independent of the membership of individuals in protected groups such as race and gender. in this work, we focus on the problem of (un)fairness in the justification of the text-based neural models. we tie the explanatory power of the model to fairness in the outcome and propose a fairness-aware summarization mechanism to detect and counteract the bias in such models. given a potentially biased natural language explanation for a decision, we use a multi-task neural model and an attribution mechanism based on integrated gradients to extract high-utility and low-bias justifications in form of a summary. the extracted summary is then used for training a model to make decisions for individuals. results on several real world datasets suggest that our method drastically limits the demographic leakage in the input (fairness in justification) while moderately enhancing the fairness in the outcome. our model is also effective in detecting and counteracting several types of data poisoning attacks that synthesize race-coded reasoning or irrelevant justifications.

2021-07-12

Reuben Binns, Reuben Kirkham
Abstract: this article examines the concept of 'ai fairness' for people with disabilities from the perspective of data protection and equality law. this examination demonstrates that there is a need for a distinctive approach to ai fairness that is fundamentally different to that used for other protected characteristics, due to the different ways in which discrimination and data protection law applies in respect of disability. we articulate this new agenda for ai fairness for people with disabilities, explaining how combining data protection and equality law creates new opportunities for disabled people's organisations and assistive technology researchers alike to shape the use of ai, as well as to challenge potential harmful uses.
Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K. Jain, Jiliang Tang
Abstract: in the past few decades, artificial intelligence (ai) technology has experienced swift developments, changing everyone's daily life and profoundly altering the course of human society. the intention of developing ai is to benefit humans, by reducing human labor, bringing everyday convenience to human lives, and promoting social good. however, recent research and ai applications show that ai can cause unintentional harm to humans, such as making unreliable decisions in safety-critical scenarios or undermining fairness by inadvertently discriminating against one group. thus, trustworthy ai has attracted immense attention recently, which requires careful consideration to avoid the adverse effects that ai may bring to humans, so that humans can fully trust and live in harmony with ai technologies. recent years have witnessed a tremendous amount of research on trustworthy ai. in this survey, we present a comprehensive survey of trustworthy ai from a computational perspective, to help readers understand the latest technologies for achieving trustworthy ai. trustworthy ai is a large and complex area, involving various dimensions. in this work, we focus on six of the most crucial dimensions in achieving trustworthy ai: (i) safety & robustness, (ii) non-discrimination & fairness, (iii) explainability, (iv) privacy, (v) accountability & auditability, and (vi) environmental well-being. for each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems. we also discuss the accordant and conflicting interactions among different dimensions and discuss potential aspects for trustworthy ai to investigate in the future.

2021-07-11

Susan Von Struensee
Abstract: there is mounting public concern over the influence that ai based systems has in our society. coalitions in all sectors are acting worldwide to resist hamful applications of ai. from indigenous people addressing the lack of reliable data, to smart city stakeholders, to students protesting the academic relationships with sex trafficker and mit donor jeffery epstein, the questionable ethics and values of those heavily investing in and profiting from ai are under global scrutiny. there are biased, wrongful, and disturbing assumptions embedded in ai algorithms that could get locked in without intervention. our best human judgment is needed to contain ai's harmful impact. perhaps one of the greatest contributions of ai will be to make us ultimately understand how important human wisdom truly is in life on earth.

2021-07-07

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob Mcgrew, Dario Amodei, Sam Mccandlish, Ilya Sutskever, Wojciech Zaremba
Abstract: we introduce codex, a gpt language model fine-tuned on publicly available code from github, and study its python code-writing capabilities. a distinct production version of codex powers github copilot. on humaneval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while gpt-3 solves 0% and gpt-j solves 11.4%. furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. using this method, we solve 70.2% of our problems with 100 samples per problem. careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
Emily Dinan, Gavin Abercrombie, A. Stevie Bergman, Shannon Spruit, Dirk Hovy, Y-Lan Boureau, Verena Rieser
Abstract: over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. however, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. researchers must thus wrestle with the issue of how and when to release these models. in this paper, we survey the problem landscape for safety for end-to-end conversational ai and discuss recent and related work. we highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. we additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational ai models.

2021-07-06

Olivia Brown, Andrew Curtis, Justin Goodwin
Abstract: the department of defense (dod) has significantly increased its investment in the design, evaluation, and deployment of artificial intelligence and machine learning (ai/ml) capabilities to address national security needs. while there are numerous ai/ml successes in the academic and commercial sectors, many of these systems have also been shown to be brittle and nonrobust. in a complex and ever-changing national security environment, it is vital that the dod establish a sound and methodical process to evaluate the performance and robustness of ai/ml models before these new capabilities are deployed to the field. this paper reviews the ai/ml development process, highlights common best practices for ai/ml model evaluation, and makes recommendations to dod evaluators to ensure the deployment of robust ai/ml capabilities for national security needs.

2021-07-05

Ron Bitton, Nadav Maman, Inderjeet Singh, Satoru Momiyama, Yuval Elovici, Asaf Shabtai
Abstract: although cyberattacks on machine learning (ml) production systems can be harmful, today, security practitioners are ill equipped, lacking methodologies and tactical tools that would allow them to analyze the security risks of their ml-based systems. in this paper, we performed a comprehensive threat analysis of ml production systems. in this analysis, we follow the ontology presented by nist for evaluating enterprise network security risk and apply it to ml-based production systems. specifically, we (1) enumerate the assets of a typical ml production system, (2) describe the threat model (i.e., potential adversaries, their capabilities, and their main goal), (3) identify the various threats to ml systems, and (4) review a large number of attacks, demonstrated in previous studies, which can realize these threats. in addition, to quantify the risk of adversarial machine learning (aml) threat, we introduce a novel scoring system, which assign a severity score to different aml attacks. the proposed scoring system utilizes the analytic hierarchy process (ahp) for ranking, with the assistance of security experts, various attributes of the attacks. finally, we developed an extension to the mulval attack graph generation and analysis framework to incorporate cyberattacks on ml production systems. using the extension, security practitioners can apply attack graph analysis methods in environments that include ml components; thus, providing security practitioners with a methodological and practical tool for evaluating the impact and quantifying the risk of a cyberattack targeting an ml production system.

2021-06-30

Sebastian Krügel, Andreas Ostermaier, Matthias Uhl
Abstract: departing from the claim that ai needs to be trustworthy, we find that ethical advice from an ai-powered algorithm is trusted even when its users know nothing about its training data and when they learn information about it that warrants distrust. we conducted online experiments where the subjects took the role of decision-makers who received advice from an algorithm on how to deal with an ethical dilemma. we manipulated the information about the algorithm and studied its influence. our findings suggest that ai is overtrusted rather than distrusted. we suggest digital literacy as a potential remedy to ensure the responsible use of ai.

2021-06-29

Yisroel Mirsky, Ambra Demontis, Jaidip Kotak, Ram Shankar, Deng Gelei, Liu Yang, Xiangyu Zhang, Wenke Lee, Yuval Elovici, Battista Biggio
Abstract: ai has provided us with the ability to automate tasks, extract information from vast amounts of data, and synthesize media that is nearly indistinguishable from the real thing. however, positive tools can also be used for negative purposes. in particular, cyber adversaries can use ai (such as machine learning) to enhance their attacks and expand their campaigns. although offensive ai has been discussed in the past, there is a need to analyze and understand the threat in the context of organizations. for example, how does an ai-capable adversary impact the cyber kill chain? does ai benefit the attacker more than the defender? what are the most significant ai threats facing organizations today and what will be their impact on the future? in this survey, we explore the threat of offensive ai on organizations. first, we present the background and discuss how ai changes the adversary's methods, strategies, goals, and overall attack model. then, through a literature review, we identify 33 offensive ai capabilities which adversaries can use to enhance their attacks. finally, through a user study spanning industry and academia, we rank the ai threats and provide insights on the adversaries.

2021-06-25

Rajiv Movva
Abstract: early studies of risk assessment algorithms used in criminal justice revealed widespread racial biases. in response, machine learning researchers have developed methods for fairness, many of which rely on equalizing empirical metrics across protected attributes. here, i recall sociotechnical perspectives to delineate the significant gap between fairness in theory and practice, focusing on criminal justice. i (1) illustrate how social context can undermine analyses that are restricted to an ai system's outputs, and (2) argue that much of the fair ml literature fails to account for epistemological issues with underlying crime data. instead of building ai that reifies power imbalances, like risk assessment algorithms, i ask whether data science can be used to understand the root causes of structural marginalization.

2021-06-24

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, Ruslan Salakhutdinov
Abstract: as machine learning methods are deployed in real-world settings such as healthcare, legal systems, and social science, it is crucial to recognize how they shape social biases and stereotypes in these sensitive decision-making processes. among such real-world deployments are large-scale pretrained language models (lms) that can be potentially dangerous in manifesting undesirable representational biases - harmful biases resulting from stereotyping that propagate negative generalizations involving gender, race, religion, and other social constructs. as a step towards improving the fairness of lms, we carefully define several sources of representational biases before proposing new benchmarks and metrics to measure them. with these tools, we propose steps towards mitigating social biases during text generation. our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for high-fidelity text generation, thereby pushing forward the performance-fairness pareto frontier.

2021-06-22

Yi-Ling Chung, Serra Sinem Tekiroglu, Marco Guerini
Abstract: tackling online hatred using informed textual responses - called counter narratives - has been brought under the spotlight recently. accordingly, a research line has emerged to automatically generate counter narratives in order to facilitate the direct intervention in the hate discussion and to prevent hate content from further spreading. still, current neural approaches tend to produce generic/repetitive responses and lack grounded and up-to-date evidence such as facts, statistics, or examples. moreover, these models can create plausible but not necessarily true arguments. in this paper we present the first complete knowledge-bound counter narrative generation pipeline, grounded in an external knowledge repository that can provide more informative content to fight online hatred. together with our approach, we present a series of experiments that show its feasibility to produce suitable and informative counter narratives in in-domain and cross-domain settings.

2021-06-21

Anjalie Field, Su Lin Blodgett, Zeerak Waseem, Yulia Tsvetkov
Abstract: despite inextricable ties between race and language, little work has considered race in nlp research and development. in this work, we survey 79 papers from the acl anthology that mention race. these papers reveal various types of race-related bias in all stages of nlp model development, highlighting the need for proactive consideration of how nlp systems can uphold racial hierarchies. however, persistent gaps in research on race and nlp remain: race has been siloed as a niche topic and remains ignored in many nlp tasks; most work operationalizes race as a fixed single-dimensional variable with a ground-truth label, which risks reinforcing differences produced by historical racism; and the voices of historically marginalized people are nearly absent in nlp literature. by identifying where and how nlp literature has and has not considered race, especially in comparison to related fields, our work calls for inclusion and racial justice in nlp research practices.
Michael S. Bernstein, Margaret Levi, David Magnus, Betsy Rajala, Debra Satz, Charla Waeiss
Abstract: artificial intelligence (ai) research is routinely criticized for its real and potential impacts on society, and we lack adequate institutional responses to this criticism and to the responsibility that it reflects. ai research often falls outside the purview of existing feedback mechanisms such as the institutional review board (irb), which are designed to evaluate harms to human subjects rather than harms to human society. in response, we have developed the ethics and society review board (esr), a feedback panel that works with researchers to mitigate negative ethical and societal aspects of ai research. the esr's main insight is to serve as a requirement for funding: researchers cannot receive grant funding from a major ai funding program at our university until the researchers complete the esr process for the proposal. in this article, we describe the esr as we have designed and run it over its first year across 41 proposals. we analyze aggregate esr feedback on these proposals, finding that the panel most commonly identifies issues of harms to minority groups, inclusion of diverse stakeholders in the research plan, dual use, and representation in data. surveys and interviews of researchers who interacted with the esr found that 58% felt that it had influenced the design of their research project, 100% are willing to continue submitting future projects to the esr, and that they sought additional scaffolding for reasoning through ethics and society issues.

2021-06-20

Yada Pruksachatkun, Satyapriya Krishna, Jwala Dhamala, Rahul Gupta, Kai-Wei Chang
Abstract: existing bias mitigation methods to reduce disparities in model outcomes across cohorts have focused on data augmentation, debiasing model embeddings, or adding fairness-based optimization objectives during training. separately, certified word substitution robustness methods have been developed to decrease the impact of spurious features and synonym substitutions on model predictions. while their end goals are different, they both aim to encourage models to make the same prediction for certain changes in the input. in this paper, we investigate the utility of certified word substitution robustness methods to improve equality of odds and equality of opportunity on multiple text classification tasks. we observe that certified robustness methods improve fairness, and using both robustness and bias mitigation methods in training results in an improvement in both fronts

2021-06-18

Irene Solaiman, Christy Dennison
Abstract: language models can generate harmful and biased outputs and exhibit undesirable behavior according to a given cultural context. we propose a process for adapting language models to society (palms) with values-targeted datasets, an iterative process to significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values. we evaluate our process using three metrics: quantitative metrics with human evaluations that score output adherence to a target value, toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations. palms performs significantly better on all metrics compared to baseline and control models for a broad range of gpt-3 language model sizes without compromising capability integrity. we find that the effectiveness of palms increases with model size. we show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.

2021-06-17

Rajitha Ramanayake, Philipp Wicke, Vivek Nallur
Abstract: we are moving towards a future where artificial intelligence (ai) based agents make many decisions on behalf of humans. from healthcare decision making to social media censoring, these agents face problems, and make decisions with ethical and societal implications. ethical behaviour is a critical characteristic that we would like in a human-centric ai. a common observation in human-centric industries, like the service industry and healthcare, is that their professionals tend to break rules, if necessary, for pro-social reasons. this behaviour among humans is defined as pro-social rule breaking. to make ai agents more human centric, we argue that there is a need for a mechanism that helps ai agents identify when to break rules set by their designers. to understand when ai agents need to break rules, we examine the conditions under which humans break rules for pro-social reasons. in this paper, we present a study that introduces a 'vaccination strategy dilemma' to human participants and analyses their responses. in this dilemma, one needs to decide whether they would distribute covid-19 vaccines only to members of a high-risk group (follow the enforced rule) or, in selected cases, administer the vaccine to a few social influencers (break the rule), which might yield an overall greater benefit to society. the results of the empirical study suggest a relationship between stakeholder utilities and pro-social rule breaking (psrb), which neither deontological nor utilitarian ethics completely explain. finally, the paper discusses the design characteristics of an ethical agent capable of psrb and the future research directions on psrb in the ai realm. we hope that this will inform the design of future ai agents, and their decision-making behaviour.

2021-06-16

Gauri Gupta, Krithika Ramesh, Sanjay Singh
Abstract: with language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. the word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. the nature of gendered languages like hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. additionally, there is sparse work done in the realm of measuring and debiasing systems for indic languages. in our work, we attempt to evaluate and quantify the gender bias within a hindi-english machine translation system. we implement a modified version of the existing tgbi metric based on the grammatical considerations for hindi. we also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.

2021-06-14

Yung-Sung Chuang, Mingye Gao, Hongyin Luo, James Glass, Hung-Yi Lee, Yun-Nung Chen, Shang-Wen Li
Abstract: automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. however, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. the biases make the learned models unfair and can even exacerbate the marginalization of people. considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (invrat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. we empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
Kiana Alikhademi, Brianna Richardson, Emma Drobina, Juan E. Gilbert
Abstract: many ml models are opaque to humans, producing decisions too complex for humans to easily understand. in response, explainable artificial intelligence (xai) tools that analyze the inner workings of a model have been created. despite these tools' strength in translating model behavior, critiques have raised concerns about the impact of xai tools as a tool for `fairwashing` by misleading users into trusting biased or incorrect models. in this paper, we created a framework for evaluating explainable ai tools with respect to their capabilities for detecting and addressing issues of bias and fairness as well as their capacity to communicate these results to their users clearly. we found that despite their capabilities in simplifying and explaining model behavior, many prominent xai tools lack features that could be critical in detecting bias. developers can use our framework to suggest modifications needed in their toolkits to reduce issues likes fairwashing.

2021-06-11

Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, Pascale Fung
Abstract: politically sensitive topics are still a challenge for open-domain chatbots. however, dealing with politically sensitive content in a responsible, non-partisan, and safe behavior way is integral for these chatbots. currently, the main approach to handling political sensitivity is by simply changing such a topic when it is detected. this is safe but evasive and results in a chatbot that is less engaging. in this work, as a first step towards a politically safe chatbot, we propose a group of metrics for assessing their political prudence. we then conduct political prudence analysis of various chatbots and discuss their behavior from multiple angles through our automatic metric and human evaluation metrics. the testsets and codebase are released to promote research in this area.

2021-06-10

Roel Dobbe, Thomas Krendl Gilbert, Yonatan Mintz
Abstract: as ai systems are integrated into high stakes social domains, researchers now examine how to design and operate them in a safe and ethical manner. however, the criteria for identifying and diagnosing safety risks in complex social contexts remain unclear and contested. in this paper, we examine the vagueness in debates about the safety and ethical behavior of ai systems. we show how this vagueness cannot be resolved through mathematical formalism alone, instead requiring deliberation about the politics of development as well as the context of deployment. drawing from a new sociotechnical lexicon, we redefine vagueness in terms of distinct design challenges at key stages in ai system development. the resulting framework of hard choices in artificial intelligence (hcai) empowers developers by 1) identifying points of overlap between design decisions and major sociotechnical challenges; 2) motivating the creation of stakeholder feedback channels so that safety issues can be exhaustively addressed. as such, hcai contributes to a timely debate about the status of ai development in democratic societies, arguing that deliberation should be the goal of ai safety, not just the procedure by which it is ensured.

2021-06-09

Sina Mohseni, Haotao Wang, Zhiding Yu, Chaowei Xiao, Zhangyang Wang, Jay Yadawa
Abstract: the open-world deployment of machine learning (ml) algorithms in safety-critical applications such as autonomous vehicles needs to address a variety of ml vulnerabilities such as interpretability, verifiability, and performance limitations. research explores different approaches to improve ml dependability by proposing new models and training techniques to reduce generalization error, achieve domain adaptation, and detect outlier examples and adversarial attacks. however, there is a missing connection between ongoing ml research and well-established safety principles. in this paper, we present a structured and comprehensive review of ml techniques to improve the dependability of ml algorithms in uncontrolled open-world settings. from this review, we propose the taxonomy of ml safety that maps state-of-the-art ml techniques to key engineering safety strategies. our taxonomy of ml safety presents a safety-oriented categorization of ml techniques to provide guidance for improving dependability of the ml design and development. the proposed taxonomy can serve as a safety checklist to aid designers in improving coverage and diversity of safety strategies employed in any given ml system.

2021-06-07

Soumya Barikeri, Anne Lauscher, Ivan Vulić, Goran Glavaš
Abstract: text representation models are prone to exhibit a range of societal biases, reflecting the non-controlled and biased nature of the underlying pretraining data, which consequently leads to severe ethical issues and even bias amplification. recent work has predominantly focused on measuring and mitigating bias in pretrained language models. surprisingly, the landscape of bias measurements and mitigation resources and methods for conversational language models is still very scarce: it is limited to only a few types of bias, artificially constructed resources, and completely ignores the impact that debiasing methods may have on the final performance in dialog tasks, e.g., conversational response generation. in this work, we present redditbias, the first conversational data set grounded in the actual human conversations from reddit, allowing for bias measurement and mitigation across four important bias dimensions: gender, race, religion, and queerness. further, we develop an evaluation framework which simultaneously 1) measures bias on the developed redditbias resource, and 2) evaluates model capability in dialog tasks after model debiasing. we use the evaluation framework to benchmark the widely used conversational dialogpt model along with the adaptations of four debiasing methods. our results indicate that dialogpt is biased with respect to religious groups and that some debiasing techniques can remove this bias while preserving downstream task performance.
Xin Guo, Jianlei Yang, Haoyi Zhou, Xucheng Ye, Jianxin Li
Abstract: pre-trained language models achieve outstanding performance in nlp tasks. various knowledge distillation methods have been proposed to reduce the heavy computation and storage requirements of pre-trained language models. however, from our observations, student models acquired by knowledge distillation suffer from adversarial attacks, which limits their usage in security sensitive scenarios. in order to overcome these security problems, rosearch is proposed as a comprehensive framework to search the student models with better adversarial robustness when performing knowledge distillation. a directed acyclic graph based search space is built and an evolutionary search strategy is utilized to guide the searching approach. each searched architecture is trained by knowledge distillation on pre-trained language model and then evaluated under a robustness-, accuracy- and efficiency-aware metric as environmental fitness. experimental results show that rosearch can improve robustness of student models from 7%~18% up to 45.8%~47.8% on different datasets with comparable weight compression ratio to existing distillation methods (4.6$\times$~6.5$\times$ improvement from teacher model bert_base) and low accuracy drop. in addition, we summarize the relationship between student architecture and robustness through statistics of searched models.

2021-06-06

Mohit Kumar, Bernhard A. Moser, Lukas Fischer, Bernhard Freudenthaler
Abstract: in order to develop machine learning and deep learning models that take into account the guidelines and principles of trustworthy ai, a novel information theoretic trustworthy ai framework is introduced. a unified approach to "privacy-preserving interpretable and transferable learning" is considered for studying and optimizing the tradeoffs between privacy, interpretability, and transferability aspects. a variational membership-mapping bayesian model is used for the analytical approximations of the defined information theoretic measures for privacy-leakage, interpretability, and transferability. the approach consists of approximating the information theoretic measures via maximizing a lower-bound using variational optimization. the study presents a unified information theoretic approach to study different aspects of trustworthy ai in a rigorous analytical manner. the approach is demonstrated through numerous experiments on benchmark datasets and a real-world biomedical application concerned with the detection of mental stress on individuals using heart rate variability analysis.

2021-06-04

Tatiana Tommasi, Silvia Bucci, Barbara Caputo, Pietro Asinari
Abstract: thanks to the great progress of machine learning in the last years, several artificial intelligence (ai) techniques have been increasingly moving from the controlled research laboratory settings to our everyday life. ai is clearly supportive in many decision-making scenarios, but when it comes to sensitive areas such as health care, hiring policies, education, banking or justice, with major impact on individuals and society, it becomes crucial to establish guidelines on how to design, develop, deploy and monitor this technology. indeed the decision rules elaborated by machine learning models are data-driven and there are multiple ways in which discriminatory biases can seep into data. algorithms trained on those data incur the risk of amplifying prejudices and societal stereotypes by over associating protected attributes such as gender, ethnicity or disabilities with the prediction task. starting from the extensive experience of the national metrology institute on measurement standards and certification roadmaps, and of politecnico di torino on machine learning as well as methods for domain bias evaluation and mastering, we propose a first joint effort to define the operational steps needed for ai fairness certification. specifically we will overview the criteria that should be met by an ai system before coming into official service and the conformity assessment procedures useful to monitor its functioning for fair decisions.
Parand Alizadeh Alamdari, Toryn Q. Klassen, Rodrigo Toro Icarte, Sheila A. Mcilraith
Abstract: recent work in ai safety has highlighted that in sequential decision making, objectives are often underspecified or incomplete. this gives discretion to the acting agent to realize the stated objective in ways that may result in undesirable outcomes. we contend that to learn to act safely, a reinforcement learning (rl) agent should include contemplation of the impact of its actions on the wellbeing and agency of others in the environment, including other acting agents and reactive processes. we endow rl agents with the ability to contemplate such impact by augmenting their reward based on expectation of future return by others in the environment, providing different criteria for characterizing impact. we further endow these agents with the ability to differentially factor this impact into their decision making, manifesting behavior that ranges from self-centred to self-less, as demonstrated by experiments in gridworld environments.

2021-06-03

Nirav Diwan, Tanmoy Chakravorty, Zubair Shafiq
Abstract: there are concerns that the ability of language models (lms) to generate high quality synthetic text can be misused to launch spam, disinformation, or propaganda. therefore, the research community is actively working on developing approaches to detect whether a given text is organic or synthetic. while this is a useful first step, it is important to be able to further fingerprint the author lm to attribute its origin. prior work on fingerprinting lms is limited to attributing synthetic text generated by a handful (usually < 10) of pre-trained lms. however, lms such as gpt2 are commonly fine-tuned in a myriad of ways (e.g., on a domain-specific text corpus) before being used to generate synthetic text. it is challenging to fingerprinting fine-tuned lms because the universe of fine-tuned lms is much larger in realistic scenarios. to address this challenge, we study the problem of large-scale fingerprinting of fine-tuned lms in the wild. using a real-world dataset of synthetic text generated by 108 different fine-tuned lms, we conduct comprehensive experiments to demonstrate the limitations of existing fingerprinting approaches. our results show that fine-tuning itself is the most effective in attributing the synthetic text generated by fine-tuned lms.
Elizabeth Excell, Noura Al Moubayed
Abstract: classifiers tend to propagate biases present in the data on which they are trained. hence, it is important to understand how the demographic identities of the annotators of comments affect the fairness of the resulting model. in this paper, we focus on the differences in the ways men and women annotate comments for toxicity, investigating how these differences result in models that amplify the opinions of male annotators. we find that the bert model as-sociates toxic comments containing offensive words with male annotators, causing the model to predict 67.7% of toxic comments as having been annotated by men. we show that this disparity between gender predictions can be mitigated by removing offensive words and highly toxic comments from the training data. we then apply the learned associations between gender and language to toxic language classifiers, finding that models trained exclusively on female-annotated data perform 1.8% better than those trained solely on male-annotated data and that training models on data after removing all offensive words reduces bias in the model by 55.5% while increasing the sensitivity by 0.4%.

2021-06-02

Paras Bhatt, Anthony Rios
Abstract: language generation models' democratization benefits many domains, from answering health-related questions to enhancing education by providing ai-driven tutoring services. however, language generation models' democratization also makes it easier to generate human-like text at-scale for nefarious activities, from spreading misinformation to targeting specific groups with hate speech. thus, it is essential to understand how people interact with bots and develop methods to detect bot-generated text. this paper shows that bot-generated text detection methods are more robust across datasets and models if we use information about how people respond to it rather than using the bot's text directly. we also analyze linguistic alignment, providing insight into differences between human-human and human-bot conversations.

2021-05-31

Mary Roszel, Robert Norvill, Jean Hilger, Radu State
Abstract: the widespread utilization of ai systems has drawn attention to the potential impacts of such systems on society. of particular concern are the consequences that prediction errors may have on real-world scenarios, and the trust humanity places in ai systems. it is necessary to understand how we can evaluate trustworthiness in ai and how individuals and entities alike can develop trustworthy ai systems. in this paper, we analyze each element of trustworthiness and provide a set of 20 guidelines that can be leveraged to ensure optimal ai functionality while taking into account the greater ethical, technical, and practical impacts to humanity. moreover, the guidelines help ensure that trustworthiness is provable and can be demonstrated, they are implementation agnostic, and they can be applied to any ai system in any sector.

2021-05-30

Carina Prunkl, Carolyn Ashurst, Markus Anderljung, Helena Webb, Jan Leike, Allan Dafoe
Abstract: turning principles into practice is one of the most pressing challenges of artificial intelligence (ai) governance. in this article, we reflect on a novel governance initiative by one of the world's largest ai conferences. in 2020, the conference on neural information processing systems (neurips) introduced a requirement for submitting authors to include a statement on the broader societal impacts of their research. drawing insights from similar governance initiatives, including institutional review boards (irbs) and impact requirements for funding applications, we investigate the risks, challenges and potential benefits of such an initiative. among the challenges, we list a lack of recognised best practice and procedural transparency, researcher opportunity costs, institutional and social pressures, cognitive biases, and the inherently difficult nature of the task. the potential benefits, on the other hand, include improved anticipation and identification of impacts, better communication with policy and governance experts, and a general strengthening of the norms around responsible research. to maximise the chance of success, we recommend measures to increase transparency, improve guidance, create incentives to engage earnestly with the process, and facilitate public deliberation on the requirement's merits and future. perhaps the most important contribution from this analysis are the insights we can gain regarding effective community-based governance and the role and responsibility of the ai research community more broadly.

2021-05-29

Lê-Nguyên Hoang, Louis Faucon, Aidan Jungo, Sergei Volodin, Dalia Papuc, Orfeas Liossatos, Ben Crulis, Mariame Tighanimine, Isabela Constantin, Anastasiia Kucherenko, Alexandre Maurer, Felix Grimberg, Vlad Nitu, Chris Vossen, Sébastien Rouault, El-Mahdi El-Mhamdi
Abstract: today's large-scale algorithms have become immensely influential, as they recommend and moderate the content that billions of humans are exposed to on a daily basis. they are the de-facto regulators of our societies' information diet, from shaping opinions on public health to organizing groups for social movements. this creates serious concerns, but also great opportunities to promote quality information. addressing the concerns and seizing the opportunities is a challenging, enormous and fabulous endeavor, as intuitively appealing ideas often come with unwanted {\it side effects}, and as it requires us to think about what we deeply prefer. understanding how today's large-scale algorithms are built is critical to determine what interventions will be most effective. given that these algorithms rely heavily on {\it machine learning}, we make the following key observation: \emph{any algorithm trained on uncontrolled data must not be trusted}. indeed, a malicious entity could take control over the data, poison it with dangerously manipulative fabricated inputs, and thereby make the trained algorithm extremely unsafe. we thus argue that the first step towards safe and ethical large-scale algorithms must be the collection of a large, secure and trustworthy dataset of reliable human judgments. to achieve this, we introduce \emph{tournesol}, an open source platform available at \url{https://tournesol.app}. tournesol aims to collect a large database of human judgments on what algorithms ought to widely recommend (and what they ought to stop widely recommending). we outline the structure of the tournesol database, the key features of the tournesol platform and the main hurdles that must be overcome to make it a successful project. most importantly, we argue that, if successful, tournesol may then serve as the essential foundation for any safe and ethical large-scale algorithm.

2021-05-25

Joymallya Chakraborty, Suvodeep Majumder, Tim Menzies
Abstract: increasingly, software is making autonomous decisions in case of criminal sentencing, approving credit cards, hiring employees, and so on. some of these decisions show bias and adversely affect certain social groups (e.g. those defined by sex, race, age, marital status). many prior works on bias mitigation take the following form: change the data or learners in multiple ways, then see if any of that improves fairness. perhaps a better approach is to postulate root causes of bias and then applying some resolution strategy. this paper postulates that the root causes of bias are the prior decisions that affect- (a) what data was selected and (b) the labels assigned to those examples. our fair-smote algorithm removes biased labels; and rebalances internal distributions such that based on sensitive attribute, examples are equal in both positive and negative classes. on testing, it was seen that this method was just as effective at reducing bias as prior approaches. further, models generated via fair-smote achieve higher performance (measured in terms of recall and f1) than other state-of-the-art fairness improvement algorithms. to the best of our knowledge, measured in terms of number of analyzed learners and datasets, this study is one of the largest studies on bias mitigation yet presented in the literature.

2021-05-19

Daryna Dementieva, Daniil Moskovskiy, Varvara Logacheva, David Dale, Olga Kozlova, Nikita Semenov, Alexander Panchenko
Abstract: we introduce the first study of automatic detoxification of russian texts to combat offensive language. such a kind of textual style transfer can be used, for instance, for processing toxic content in social media. while much work has been done for the english language in this field, it has never been solved for the russian language yet. we test two types of models - unsupervised approach based on bert architecture that performs local corrections and supervised approach based on pretrained language gpt-2 model - and compare them with several baselines. in addition, we describe evaluation setup providing training datasets and metrics for automatic evaluation. the results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.

2021-05-13

Iain Barclay, Alun Preece, Ian Taylor, Swapna K. Radha, Jarek Nabrzyski
Abstract: adopting shared data resources requires scientists to place trust in the originators of the data. when shared data is later used in the development of artificial intelligence (ai) systems or machine learning (ml) models, the trust lineage extends to the users of the system, typically practitioners in fields such as healthcare and finance. practitioners rely on ai developers to have used relevant, trustworthy data, but may have limited insight and recourse. this paper introduces a software architecture and implementation of a system based on design patterns from the field of self-sovereign identity. scientists can issue signed credentials attesting to qualities of their data resources. data contributions to ml models are recorded in a bill of materials (bom), which is stored with the model as a verifiable credential. the bom provides a traceable record of the supply chain for an ai system, which facilitates on-going scrutiny of the qualities of the contributing components. the verified bom, and its linkage to certified data qualities, is used in the ai scrutineer, a web-based tool designed to offer practitioners insight into ml model constituents and highlight any problems with adopted datasets, should they be found to have biased data or be otherwise discredited.
Nathan Dolbir, Triyasha Dastidar, Kaushik Roy
Abstract: ai chatbots have made vast strides in technology improvement in recent years and are already operational in many industries. advanced natural language processing techniques, based on deep networks, efficiently process user requests to carry out their functions. as chatbots gain traction, their applicability in healthcare is an attractive proposition due to the reduced economic and people costs of an overburdened system. however, healthcare bots require safe and medically accurate information capture, which deep networks aren't yet capable of due to user text and speech variations. knowledge in symbolic structures is more suited for accurate reasoning but cannot handle natural language processing directly. thus, in this paper, we study the effects of combining knowledge and neural representations on chatbot safety, accuracy, and understanding.
Edmon Begoli, Robert A. Bridges, Sean Oesch, Kathryn E. Knight
Abstract: policy-mandated, rigorously administered scientific testing is needed to provide transparency into the efficacy of artificial intelligence-based (ai-based) cyber defense tools for consumers and to prioritize future research and development. in this article, we propose a model that is informed by our experience, urged forward by massive scale cyberattacks, and inspired by parallel developments in the biomedical field and the unprecedentedly fast development of new vaccines to combat global pathogens.
Nengfeng Zhou, Zach Zhang, Vijayan N. Nair, Harsh Singhal, Jie Chen, Agus Sudjianto
Abstract: the advent of ai and ml algorithms has led to opportunities as well as challenges. in this paper, we provide an overview of bias and fairness issues that arise with the use of ml algorithms. we describe the types and sources of data bias, and discuss the nature of algorithmic unfairness. this is followed by a review of fairness metrics in the literature, discussion of their limitations, and a description of de-biasing (or mitigation) techniques in the model life cycle.

2021-05-09

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng
Abstract: technology for language generation has advanced rapidly, spurred by advancements in pre-training large models on massive amounts of data and the need for intelligent agents to communicate in a natural manner. while techniques can effectively generate fluent text, they can also produce undesirable societal biases that can have a disproportionately negative impact on marginalized populations. language generation presents unique challenges for biases in terms of direct user interaction and the structure of decoding techniques. to better understand these challenges, we present a survey on societal biases in language generation, focusing on how data and techniques contribute to biases and progress towards reducing biases. motivated by a lack of studies on biases from decoding techniques, we also conduct experiments to quantify the effects of these techniques. by further discussing general trends and open challenges, we call to attention promising directions for research and the importance of fairness and inclusivity considerations for language generation applications.

2021-05-06

Samson Tan, Shafiq Joty, Kathy Baxter, Araz Taeihagh, Gregory A. Bennett, Min-Yen Kan
Abstract: questions of fairness, robustness, and transparency are paramount to address before deploying nlp systems. central to these concerns is the question of reliability: can nlp systems reliably treat different demographics fairly and function correctly in diverse and noisy environments? to address this, we argue for the need for reliability testing and contextualize it among existing work on improving accountability. we show how adversarial attacks can be reframed for this goal, via a framework for developing reliability tests. we argue that reliability testing -- with an emphasis on interdisciplinary collaboration -- will enable rigorous and targeted testing, and aid in the enactment and enforcement of industry standards.
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, Yejin Choi
Abstract: despite recent advances in natural language generation, it remains challenging to control attributes of generated text. we propose dexperts: decoding-time experts, a decoding-time method for controlled text generation that combines a pretrained language model with "expert" lms and/or "anti-expert" lms in a product of experts. intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. we apply dexperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. moreover, because dexperts operates only on the output of the pretrained lm, it is effective with (anti-)experts of smaller size, including when operating on gpt-3. our work highlights the promise of tuning small lms on text with (un)desirable attributes for efficient decoding-time steering.

2021-05-05

Baobao Zhang, Markus Anderljung, Lauren Kahn, Noemi Dreksler, Michael C. Horowitz, Allan Dafoe
Abstract: machine learning (ml) and artificial intelligence (ai) researchers play an important role in the ethics and governance of ai, including taking action against what they perceive to be unethical uses of ai (belfield, 2020; van noorden, 2020). nevertheless, this influential group's attitudes are not well understood, which undermines our ability to discern consensuses or disagreements between ai/ml researchers. to examine these researchers' views, we conducted a survey of those who published in the top ai/ml conferences (n = 524). we compare these results with those from a 2016 survey of ai/ml researchers (grace, salvatier, dafoe, zhang, & evans, 2018) and a 2018 survey of the us public (zhang & dafoe, 2020). we find that ai/ml researchers place high levels of trust in international organizations and scientific organizations to shape the development and use of ai in the public interest; moderate trust in most western tech companies; and low trust in national militaries, chinese tech companies, and facebook. while the respondents were overwhelmingly opposed to ai/ml researchers working on lethal autonomous weapons, they are less opposed to researchers working on other military applications of ai, particularly logistics algorithms. a strong majority of respondents think that ai safety research should be prioritized and that ml institutions should conduct pre-publication review to assess potential harms. being closer to the technology itself, ai/ml re-searchers are well placed to highlight new risks and develop technical solutions, so this novel attempt to measure their attitudes has broad relevance. the findings should help to improve how researchers, private sector executives, and policymakers think about regulations, governance frameworks, guiding principles, and national and international governance strategies for ai.

2021-05-03

Simon Enni, Ira Assent
Abstract: the influence of machine learning (ml) is quickly spreading, and a number of recent technological innovations have applied ml as a central technology. however, ml development still requires a substantial amount of human expertise to be successful. the deliberation and expert judgment applied during ml development cannot be revisited or scrutinized if not properly documented, and this hinders the further adoption of ml technologies--especially in safety critical situations. in this paper, we present a method consisting of eight design questions, that outline the deliberation and normative choices going into creating a ml model. our method affords several benefits, such as supporting critical assessment through methodological transparency, aiding in model debugging, and anchoring model explanations by committing to a pre hoc expectation of the model's behavior. we believe that our method can help ml practitioners structure and justify their choices and assumptions when developing ml models, and that it can help bridge a gap between those inside and outside the ml field in understanding how and why ml models are designed and developed the way they are.

2021-05-02

Roman V. Yampolskiy
Abstract: in this work, we survey skepticism regarding ai risk and show parallels with other types of scientific skepticism. we start by classifying different types of ai risk skepticism and analyze their root causes. we conclude by suggesting some intervention approaches, which may be successful in reducing ai risk skepticism, at least amongst artificial intelligence researchers.

2021-05-01

Shaofeng Li, Hui Liu, Tian Dong, Benjamin Zi Hao Zhao, Minhui Xue, Haojin Zhu, Jialiang Lu
Abstract: natural language processing (nlp) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. in this paper, we create covert and natural triggers for textual backdoor attacks, \textit{hidden backdoors}, where triggers can fool both modern language models and human inspection. we deploy our hidden backdoors through two state-of-the-art trigger embedding methods. the first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. the second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. we demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical nlp tasks, representative of modern human-centric nlp systems, including toxic comment detection, neural machine translation (nmt), and question answering (qa). our two hidden backdoor attacks can achieve an attack success rate (asr) of at least $97\%$ with an injection rate of only $3\%$ in toxic comment detection, $95.1\%$ asr in nmt with less than $0.5\%$ injected data, and finally $91.12\%$ asr against qa updated with only 27 poisoning data samples on a model previously trained with 92,024 samples (0.029\%). we are able to demonstrate the adversary's high success rate of attacks, while maintaining functionality for regular users, with triggers inconspicuous by the human administrators.

2021-04-30

Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, Lili Wang, Soroush Vosoughi
Abstract: current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. in this paper, we describe metrics for measuring political bias in gpt-2 generation and propose a reinforcement learning (rl) framework for mitigating political biases in generated text. by using rewards from word embeddings or a classifier, our rl framework guides debiased generation without having access to the training data or requiring the model to be retrained. in empirical experiments on three attributes sensitive to political bias (gender, location, and topic), our methods reduced bias according to both our metrics and human evaluation, while maintaining readability and semantic coherence.
Jakob Mokander, Luciano Floridi
Abstract: a series of recent developments points towards auditing as a promising mechanism to bridge the gap between principles and practice in ai ethics. building on ongoing discussions concerning ethics-based auditing, we offer three contributions. first, we argue that ethics-based auditing can improve the quality of decision making, increase user satisfaction, unlock growth potential, enable law-making, and relieve human suffering. second, we highlight current best practices to support the design and implementation of ethics-based auditing: to be feasible and effective, ethics-based auditing should take the form of a continuous and constructive process, approach ethical alignment from a system perspective, and be aligned with public policies and incentives for ethically desirable behaviour. third, we identify and discuss the constraints associated with ethics-based auditing. only by understanding and accounting for these constraints can ethics-based auditing facilitate ethical alignment of ai, while enabling society to reap the full economic and social benefits of automation.

2021-04-29

Sebastian Houben, Stephanie Abrecht, Maram Akila, Andreas Bär, Felix Brockherde, Patrick Feifel, Tim Fingscheidt, Sujan Sai Gannamaneni, Seyed Eghbal Ghobadi, Ahmed Hammam, Anselm Haselhoff, Felix Hauser, Christian Heinzemann, Marco Hoffmann, Nikhil Kapoor, Falk Kappel, Marvin Klingner, Jan Kronenberger, Fabian Küppers, Jonas Löhdefink, Michael Mlynarski, Michael Mock, Firas Mualla, Svetlana Pavlitskaya, Maximilian Poretschkin, Alexander Pohl, Varun Ravi-Kumar, Julia Rosenzweig, Matthias Rottmann, Stefan Rüping, Timo Sämann, Jan David Schneider, Elena Schulz, Gesina Schwalbe, Joachim Sicking, Toshika Srivastava, Serin Varghese, Michael Weber, Sebastian Wirkert, Tim Wirtz, Matthias Woehrle
Abstract: the use of deep neural networks (dnns) in safety-critical applications like mobile health and autonomous driving is challenging due to numerous model-inherent shortcomings. these shortcomings are diverse and range from a lack of generalization over insufficient interpretability to problems with malicious inputs. cyber-physical systems employing dnns are therefore likely to suffer from safety concerns. in recent years, a zoo of state-of-the-art techniques aiming to address these safety concerns has emerged. this work provides a structured and broad overview of them. we first identify categories of insufficiencies to then describe research activities aiming at their detection, quantification, or mitigation. our paper addresses both machine learning experts and safety engineers: the former ones might profit from the broad range of machine learning topics covered and discussions on limitations of recent methods. the latter ones might gain insights into the specifics of modern ml methods. we moreover hope that our contribution fuels discussions on desiderata for ml systems and strategies on how to propel existing approaches accordingly.

2021-04-26

Rob Geada, Tommaso Teofili, Rui Vieira, Rebecca Whitworth, Daniele Zonca
Abstract: artificial intelligence (ai) is becoming increasingly more popular and can be found in workplaces and homes around the world. the decisions made by such "black box" systems are often opaque; that is, so complex as to be functionally impossible to understand. how do we ensure that these systems are behaving as desired? trustyai is an initiative which looks into explainable artificial intelligence (xai) solutions to address this issue of explainability in the context of both ai models and decision services. this paper presents the trustyai explainability toolkit, a java and python library that provides xai explanations of decision services and predictive models for both enterprise and data science use-cases. we describe the trustyai implementations and extensions to techniques such as lime, shap and counterfactuals, which are benchmarked against existing implementations in a variety of experiments.

2021-04-23

N/A Anjum, Rahul Katarya
Abstract: social media and the internet have become an integral part of how people spread and consume information. over a period of time, social media evolved dramatically, and almost half of the population is using social media to express their views and opinions. online hate speech is one of the drawbacks of social media nowadays, which needs to be controlled. in this paper, we will understand how hate speech originated and what are the consequences of it; trends of machine-learning algorithms to solve an online hate speech problem. this study contributes by providing a systematic approach to help researchers to identify a new research direction and elucidating the shortcomings of the studies and model, as well as providing future directions to advance the field.

2021-04-22

Robert M. Williams, Roman V. Yampolskiy
Abstract: as ai technologies increase in capability and ubiquity, ai accidents are becoming more common. based on normal accident theory, high reliability theory, and open systems theory, we create a framework for understanding the risks associated with ai applications. in addition, we also use ai safety principles to quantify the unique risks of increased intelligence and human-like qualities in ai. together, these two fields give a more complete picture of the risks of contemporary ai. by focusing on system properties near accidents instead of seeking a root cause of accidents, we identify where attention should be paid to safety for current generation ai systems.

2021-04-20

Neil Dhir, Henrique Hoeltgebaum, Niall Adams, Mark Briers, Anthony Burke, Paul Jones
Abstract: cybercriminals are rapidly developing new malicious tools that leverage artificial intelligence (ai) to enable new classes of adaptive and stealthy attacks. new defensive methods need to be developed to counter these threats. some cybersecurity professionals are speculating ai will enable corresponding new classes of active cyber defence measures -- is this realistic, or currently mostly hype? the alan turing institute, with expert guidance from the uk national cyber security centre and defence science technology laboratory, published a research roadmap for ai for acd last year. this position paper updates the roadmap for two of the most promising ai approaches -- reinforcement learning and causal inference - and describes why they could help tip the balance back towards defenders.

2021-04-19

Md Sultan Al Nahian, Spencer Frazier, Brent Harrison, Mark Riedl
Abstract: as more machine learning agents interact with humans, it is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm. value alignment is a property of intelligent agents wherein they solely pursue non-harmful behaviors or human-beneficial goals. we introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward. the normative behavior reward is derived from a value-aligned prior model previously shown to classify text as normative or non-normative. we show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative. we test our value-alignment technique on three interactive text-based worlds; each world is designed specifically to challenge agents with a task as well as provide opportunities to deviate from the task to engage in normative and/or altruistic behavior.

2021-04-18

Emily Sheng, Josh Arnold, Zhou Yu, Kai-Wei Chang, Nanyun Peng
Abstract: dialogue systems in the form of chatbots and personal assistants are being increasingly integrated into people's lives. modern dialogue systems may consider adopting anthropomorphic personas, mimicking societal demographic groups to appear more approachable and trustworthy to users. however, the adoption of a persona can result in the adoption of biases. in this paper, we present the first large-scale study on persona biases in dialogue systems and conduct analyses on personas of different social classes, sexual orientations, races, and genders. we define persona biases as harmful differences in responses (e.g., varying levels of offensiveness, agreement with harmful statements) generated from adopting different demographic personas. furthermore, we introduce an open-source framework, unitpersonabias, to explore and aggregate persona biases in dialogue systems. by analyzing the blender and dialogpt dialogue systems, we observe that adopting personas can actually decrease harmful responses, compared to not using any personas. additionally, we find that persona choices can affect the degree of harms in generated responses and thus should be systematically evaluated before deployment. we also analyze how personas can result in different amounts of harm towards specific demographics.

2021-04-17

Tejas Srinivasan, Yonatan Bisk
Abstract: numerous works have analyzed biases in vision and pre-trained language models individually - however, less attention has been paid to how these biases interact in multimodal settings. this work extends text-based bias analysis methods to investigate multimodal language models, and analyzes intra- and inter-modality associations and biases learned by these models. specifically, we demonstrate that vl-bert (su et al., 2020) exhibits gender biases, often preferring to reinforce a stereotype over faithfully describing the visual scene. we demonstrate these findings on a controlled case-study and extend them for a larger set of stereotypically gendered entities.

2021-04-16

Xiang Gao, Yizhe Zhang, Michel Galley, Bill Dolan
Abstract: the design of better automated dialogue evaluation metrics offers the potential of accelerate evaluation research on conversational ai. however, existing trainable dialogue evaluation models are generally restricted to classifiers trained in a purely supervised manner, which suffer a significant risk from adversarial attacking (e.g., a nonsensical response that enjoys a high classification score). to alleviate this risk, we propose an adversarial training approach to learn a robust model, att (adversarial turing test), that discriminates machine-generated responses from human-written replies. in contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples using reinforcement learning. the key benefit of this unrestricted adversarial training approach is allowing the discriminator to improve robustness in an iterative attack-defense game. our discriminator shows high accuracy on strong attackers including dialogpt and gpt-3.
Abhyuday Jagannatha, Bhanu Pratap Singh Rawat, Hong Yu
Abstract: deep neural network (dnn) models have been shown to have high empirical privacy leakages. clinical language models (clms) trained on clinical data have been used to improve performance in biomedical natural language processing tasks. in this work, we investigate the risks of training-data leakage through white-box or black-box access to clms. we design and employ membership inference attacks to estimate the empirical privacy leaks for model architectures like bert and gpt2. we show that membership inference attacks on clms lead to non-trivial privacy leakages of up to 7%. our results show that smaller models have lower empirical privacy leakages than larger ones, and masked lms have lower leakages than auto-regressive lms. we further show that differentially private clms can have improved model utility on clinical domain while ensuring low empirical privacy leakage. lastly, we also study the effects of group-level membership inference and disease rarity on clm privacy leakages.

2021-04-15

Masahiro Kaneko, Danushka Bollegala
Abstract: masked language models (mlms) have shown superior performances in numerous downstream nlp tasks when used as text encoders. unfortunately, mlms also demonstrate significantly worrying levels of social biases. we show that the previously proposed evaluation metrics for quantifying the social biases in mlms are problematic due to following reasons: (1) prediction accuracy of the masked tokens itself tend to be low in some mlms, which raises questions regarding the reliability of the evaluation metrics that use the (pseudo) likelihood of the predicted tokens, and (2) the correlation between the prediction accuracy of the mask and the performance in downstream nlp tasks is not taken into consideration, and (3) high frequency words in the training data are masked more often, introducing noise due to this selection bias in the test cases. to overcome the above-mentioned disfluencies, we propose all unmasked likelihood (aul), a bias evaluation measure that predicts all tokens in a test case given the mlm embedding of the unmasked input. we find that aul accurately detects different types of biases in mlms. we also propose aul with attention weights (aula) to evaluate tokens based on their importance in a sentence. however, unlike aul and aula, previously proposed bias evaluation measures for mlms systematically overestimate the measured biases, and are heavily influenced by the unmasked tokens in the context.
Karolina Stańczak, Sagnik Ray Choudhury, Tiago Pimentel, Ryan Cotterell, Isabelle Augenstein
Abstract: while the prevalence of large pre-trained language models has led to significant improvements in the performance of nlp systems, recent research has demonstrated that these models inherit societal biases extant in natural language. in this paper, we explore a simple method to probe pre-trained language models for gender bias, which we use to effect a multi-lingual study of gender bias towards politicians. we construct a dataset of 250k politicians from most countries in the world and quantify adjective and verb usage around those politicians' names as a function of their gender. we conduct our study in 7 languages across 6 different language modeling architectures. our results demonstrate that stance towards politicians in pre-trained language models is highly dependent on the language used. finally, contrary to previous findings, our study suggests that larger language models do not tend to be significantly more gender-biased than smaller ones.
Eric Lehman, Sarthak Jain, Karl Pichotta, Yoav Goldberg, Byron C. Wallace
Abstract: large transformers pretrained over clinical notes from electronic health records (ehr) have afforded substantial gains in performance on predictive clinical tasks. the cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as clinicalbert. while most efforts have used deidentified ehr, many researchers have access to large sets of sensitive, non-deidentified ehr with which they might train a bert model (or similar). would it be safe to release the weights of such a model if they did? in this work, we design a battery of approaches intended to recover personal health information (phi) from a trained bert. specifically, we attempt to recover patient names and conditions with which they are associated. we find that simple probing methods are not able to meaningfully extract sensitive information from bert trained over the mimic-iii corpus of ehr. however, more sophisticated "attacks" may succeed in doing so: to facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release

2021-04-14

Haswanth Aekula, Sugam Garg, Animesh Gupta
Abstract: despite widespread use in natural language processing (nlp) tasks, word embeddings have been criticized for inheriting unintended gender bias from training corpora. programmer is more closely associated with man and homemaker is more closely associated with woman. such gender bias has also been shown to propagate in downstream tasks.

2021-04-13

Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, Dan Klein
Abstract: language models (lms) must be both safe and equitable to be responsibly deployed in practice. with safety in mind, numerous detoxification techniques (e.g., dathathri et al. 2020; krause et al. 2020) have been proposed to mitigate toxic lm generations. in this work, we show that current detoxification techniques hurt equity: they decrease the utility of lms on language used by marginalized groups (e.g., african-american english and minority identity mentions). in particular, we perform automatic and human evaluations of text generation quality when lms are conditioned on inputs with different dialects and group identifiers. we find that detoxification makes lms more brittle to distribution shift, especially on language used by marginalized groups. we identify that these failures stem from detoxification methods exploiting spurious correlations in toxicity datasets. overall, our results highlight the tension between the controllability and distributional robustness of lms.

2021-04-09

Hendrik Heuer
Abstract: at the latest since the advent of the internet, disinformation and conspiracy theories have become ubiquitous. recent examples like qanon and pizzagate prove that false information can lead to real violence. in this motivation statement for the workshop on human aspects of misinformation at chi 2021, i explain my research agenda focused on 1. why people believe in disinformation, 2. how people can be best supported in recognizing disinformation, and 3. what the potentials and risks of different tools designed to fight disinformation are.
Lambert Hogenhout
Abstract: this paper aims to provide an overview of the ethical concerns in artificial intelligence (ai) and the framework that is needed to mitigate those risks, and to suggest a practical path to ensure the development and use of ai at the united nations (un) aligns with our ethical values. the overview discusses how ai is an increasingly powerful tool with potential for good, albeit one with a high risk of negative side-effects that go against fundamental human rights and un values. it explains the need for ethical principles for ai aligned with principles for data governance, as data and ai are tightly interwoven. it explores different ethical frameworks that exist and tools such as assessment lists. it recommends that the un develop a framework consisting of ethical principles, architectural standards, assessment methods, tools and methodologies, and a policy to govern the implementation and adherence to this framework, accompanied by an education program for staff.

2021-04-08

Yakoob Khan, Weicheng Ma, Soroush Vosoughi
Abstract: this paper describes our approach to the toxic spans detection problem (semeval-2021 task 5). we propose bertoxic, a system that fine-tunes a pre-trained bert model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the boundaries. the post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. through experiments, we show that these two post-processing steps improve the performance of our model by 4.16% on the test set. we also studied the effects of data augmentation and ensemble modeling strategies on our system. our system significantly outperformed the provided baseline and achieved an f1-score of 0.683, placing lone pine in the 17th place out of 91 teams in the competition. our code is made available at https://github.com/yakoob-khan/toxic-spans-detection
The Anh Han, Tom Lenaerts, Francisco C. Santos, Luis Moniz Pereira
Abstract: with the introduction of artificial intelligence (ai) and related technologies in our daily lives, fear and anxiety about their misuse as well as the hidden biases in their creation have led to a demand for regulation to address such issues. yet blindly regulating an innovation process that is not well understood, may stifle this process and reduce benefits that society may gain from the generated technology, even under the best intentions. in this paper, starting from a baseline model that captures the fundamental dynamics of a race for domain supremacy using ai technology, we demonstrate how socially unwanted outcomes may be produced when sanctioning is applied unconditionally to risk-taking, i.e. potentially unsafe, behaviours. as an alternative to resolve the detrimental effect of over-regulation, we propose a voluntary commitment approach wherein technologists have the freedom of choice between independently pursuing their course of actions or establishing binding agreements to act safely, with sanctioning of those that do not abide to what they pledged. overall, this work reveals for the first time how voluntary commitments, with sanctions either by peers or an institution, leads to socially beneficial outcomes in all scenarios envisageable in a short-term race towards domain supremacy through ai technology. these results are directly relevant for the design of governance and regulatory policies that aim to ensure an ethical and responsible ai technology development process.

2021-04-05

Olawale Onabola, Zhuang Ma, Yang Xie, Benjamin Akera, Abdulrahman Ibraheem, Jia Xue, Dianbo Liu, Yoshua Bengio
Abstract: subtle and overt racism is still present both in physical and online communities today and has impacted many lives in different segments of the society. in this short piece of work, we present how we're tackling this societal issue with natural language processing. we are releasing biascorp, a dataset containing 139,090 comments and news segment from three specific sources - fox news, breitbartnews and youtube. the first batch (45,000 manually annotated) is ready for publication. we are currently in the final phase of manually labeling the remaining dataset using amazon mechanical turk. bert has been used widely in several downstream tasks. in this work, we present hbert, where we modify certain layers of the pretrained bert model with the new hopfield layer. hbert generalizes well across different distributions with the added advantage of a reduced model complexity. we are also releasing a javascript library and a chrome extension application, to help developers make use of our trained model in web applications (say chat application) and for users to identify and report racially biased contents on the web respectively.

2021-04-02

Aishwarya Gupta, Avik Pal, Bholeshwar Khurana, Lakshay Tyagi, Ashutosh Modi
Abstract: humor and offense are highly subjective due to multiple word senses, cultural knowledge, and pragmatic competence. hence, accurately detecting humorous and offensive texts has several compelling use cases in recommendation systems and personalized content moderation. however, due to the lack of an extensive labeled dataset, most prior works in this domain haven't explored large neural models for subjective humor understanding. this paper explores whether large neural models and their ensembles can capture the intricacies associated with humor/offense detection and rating. our experiments on the semeval-2021 task 7: hahackathon show that we can develop reasonable humor and offense detection systems with such models. our models are ranked third in subtask 1b and consistently ranked around the top 33% of the leaderboard for the remaining subtasks.

2021-04-01

Nayeon Lee, Yejin Bang, Andrea Madotto, Pascale Fung
Abstract: media bias can lead to increased political polarization, and thus, the need for automatic mitigation methods is growing. existing mitigation work displays articles from multiple news outlets to provide diverse news coverage, but without neutralizing the bias inherent in each of the displayed articles. therefore, we propose a new task, a single neutralized article generation out of multiple biased articles, to facilitate more efficient access to balanced and unbiased information. in this paper, we compile a new dataset neuws, define an automatic evaluation metric, and provide baselines and multiple analyses to serve as a solid starting point for the proposed task. lastly, we obtain a human evaluation to demonstrate the alignment between our metric and human judgment.

2021-03-31

Philip Matthias Winter, Sebastian Eder, Johannes Weissenböck, Christoph Schwald, Thomas Doms, Tom Vogt, Sepp Hochreiter, Bernhard Nessler
Abstract: artificial intelligence is one of the fastest growing technologies of the 21st century and accompanies us in our daily lives when interacting with technical applications. however, reliance on such technical systems is crucial for their widespread applicability and acceptance. the societal tools to express reliance are usually formalized by lawful regulations, i.e., standards, norms, accreditations, and certificates. therefore, the t\"uv austria group in cooperation with the institute for machine learning at the johannes kepler university linz, proposes a certification process and an audit catalog for machine learning applications. we are convinced that our approach can serve as the foundation for the certification of applications that use machine learning and deep learning, the techniques that drive the current revolution in artificial intelligence. while certain high-risk areas, such as fully autonomous robots in workspaces shared with humans, are still some time away from certification, we aim to cover low-risk applications with our certification procedure. our holistic approach attempts to analyze machine learning applications from multiple perspectives to evaluate and verify the aspects of secure software development, functional requirements, data quality, data protection, and ethics. inspired by existing work, we introduce four criticality levels to map the criticality of a machine learning application regarding the impact of its decisions on people, environment, and organizations. currently, the audit catalog can be applied to low-risk applications within the scope of supervised learning as commonly encountered in industry. guided by field experience, scientific developments, and market demands, the audit catalog will be extended and modified accordingly.
Kalia Orphanou, Jahna Otterbacher, Styliani Kleanthous, Khuyagbaatar Batsuren, Fausto Giunchiglia, Veronika Bogina, Avital Shulner Tal, N/A Alanhartman, Tsvi Kuflik
Abstract: mitigating bias in algorithmic systems is a critical issue drawing attention across communities within the information and computer sciences. given the complexity of the problem and the involvement of multiple stakeholders -- including developers, end-users, and third parties -- there is a need to understand the landscape of the sources of bias, and the solutions being proposed to address them, from a broad, cross-domain perspective. this survey provides a "fish-eye view," examining approaches across four areas of research. the literature describes three steps toward a comprehensive treatment -- bias detection, fairness management and explainability management -- and underscores the need to work from within the system as well as from the perspective of stakeholders in the broader context.

2021-03-26

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, Geoffrey Irving
Abstract: for artificial intelligence to be beneficial to humans the behaviour of ai agents needs to be aligned with what humans want. in this paper we discuss some behavioural issues for language agents, arising from accidental misspecification by the system designer. we highlight some ways that misspecification can occur and discuss some behavioural issues that could arise from misspecification, including deceptive or manipulative language, and review some approaches for avoiding these issues.

2021-03-24

Michele Colledanchise
Abstract: robots applications in our daily life increase at an unprecedented pace. as robots will soon operate "out in the wild", we must identify the safety and security vulnerabilities they will face. robotics researchers and manufacturers focus their attention on new, cheaper, and more reliable applications. still, they often disregard the operability in adversarial environments where a trusted or untrusted user can jeopardize or even alter the robot's task. in this paper, we identify a new paradigm of security threats in the next generation of robots. these threats fall beyond the known hardware or network-based ones, and we must find new solutions to address them. these new threats include malicious use of the robot's privileged access, tampering with the robot sensors system, and tricking the robot's deliberation into harmful behaviors. we provide a taxonomy of attacks that exploit these vulnerabilities with realistic examples, and we outline effective countermeasures to prevent better, detect, and mitigate them.

2021-03-23

Ke-Li Chiu, Annie Collins, Rohan Alexander
Abstract: sophisticated language models such as openai's gpt-3 can generate hateful text that targets marginalized groups. given this capacity, we are interested in whether large language models can be used to identify hate speech and classify text as sexist or racist. we use gpt-3 to identify sexist and racist text passages with zero-, one-, and few-shot learning. we find that with zero- and one-shot learning, gpt-3 can identify sexist or racist text with an average accuracy between 55 per cent and 67 per cent, depending on the category of text and type of learning. with few-shot learning, the model's accuracy can be as high as 85 per cent. large language models have a role to play in hate speech detection, and with further development they could eventually be used to counter hate speech.

2021-03-21

Ninareh Mehrabi, Pei Zhou, Fred Morstatter, Jay Pujara, Xiang Ren, Aram Galstyan
Abstract: warning: this paper contains content that may be offensive or upsetting. numerous natural language processing models have tried injecting commonsense by using the conceptnet knowledge base to improve performance on different tasks. conceptnet, however, is mostly crowdsourced from humans and may reflect human biases such as "lawyers are dishonest." it is important that these biases are not conflated with the notion of commonsense. we study this missing yet important problem by first defining and quantifying biases in conceptnet as two types of representational harms: overgeneralization of polarized perceptions and representation disparity. we find that conceptnet contains severe biases and disparities across four demographic categories. in addition, we analyze two downstream models that use conceptnet as a source for commonsense knowledge and find the existence of biases in those models as well. we further propose a filtered-based bias-mitigation approach and examine its effectiveness. we show that our mitigation approach can reduce the issues in both resource and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.

2021-03-19

Dian Yu, Zhou Yu, Kenji Sagae
Abstract: large language models benefit from training with a large amount of unlabeled text, which gives them increasingly fluent and diverse generation capabilities. however, using these models for text generation that takes into account target attributes, such as sentiment polarity or specific topics, remains a challenge. we propose a simple and flexible method for controlling text generation by aligning disentangled attribute representations. in contrast to recent efforts on training a discriminator to perturb the token level distribution for an attribute, we use the same data to learn an alignment function to guide the pre-trained, non-controlled language model to generate texts with the target attribute without changing the original language model parameters. we evaluate our method on sentiment- and topic-controlled generation, and show large performance gains over previous methods while retaining fluency and diversity.

2021-03-18

Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, Xiang Ren
Abstract: pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. prior work reveals such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. however, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. in this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. by parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. we additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. we demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
David A. Noever, Samantha E. Miller Noever
Abstract: with open ai's publishing of their clip model (contrastive language-image pre-training), multi-modal neural networks now provide accessible models that combine reading with visual recognition. their network offers novel ways to probe its dual abilities to read text while classifying visual objects. this paper demonstrates several new categories of adversarial attacks, spanning basic typographical, conceptual, and iconographic inputs generated to fool the model into making false or absurd classifications. we demonstrate that contradictory text and image signals can confuse the model into choosing false (visual) options. like previous authors, we show by example that the clip model tends to read first, look later, a phenomenon we describe as reading isn't believing.

2021-03-12

Matteo Camilli, Michael Felderer, Andrea Giusti, Dominik T. Matt, Anna Perini, Barbara Russo, Angelo Susi
Abstract: collaborative ai systems aim at working together with humans in a shared space to achieve a common goal. this setting imposes potentially hazardous circumstances due to contacts that could harm human beings. thus, building such systems with strong assurances of compliance with requirements domain specific standards and regulations is of greatest importance. challenges associated with the achievement of this goal become even more severe when such systems rely on machine learning components rather than such as top-down rule-based ai. in this paper, we introduce a risk modeling approach tailored to collaborative ai systems. the risk model includes goals, risk events and domain specific indicators that potentially expose humans to hazards. the risk model is then leveraged to drive assurance methods that feed in turn the risk model through insights extracted from run-time evidence. our envisioned approach is described by means of a running example in the domain of industry 4.0, where a robotic arm endowed with a visual perception component, implemented with machine learning, collaborates with a human operator for a production-relevant task.

2021-03-11

Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan
Abstract: framing involves the positive or negative presentation of an argument or issue depending on the audience and goal of the speaker (entman 1983). differences in lexical framing, the focus of our work, can have large effects on peoples' opinions and beliefs. to make progress towards reframing arguments for positive effects, we create a dataset and method for this task. we use a lexical resource for "connotations" to create a parallel corpus and propose a method for argument reframing that combines controllable text generation (positive connotation) with a post-decoding entailment component (same denotation). our results show that our method is effective compared to strong baselines along the dimensions of fluency, meaning, and trustworthiness/reduction of fear.

2021-03-10

Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith Ringel Morris, Jennifer Wortman Vaughan, Duncan Wadsworth, Hanna Wallach
Abstract: disaggregated evaluations of ai systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple. however, their design involves a variety of choices. some of these choices influence the results that will be obtained, and thus the conclusions that can be drawn; others influence the impacts -- both beneficial and harmful -- that a disaggregated evaluation will have on people, including the people whose data is used to conduct the evaluation. we argue that a deeper understanding of these choices will enable researchers and practitioners to design careful and conclusive disaggregated evaluations. we also argue that better documentation of these choices, along with the underlying considerations and tradeoffs that have been made, will help others when interpreting an evaluation's results and conclusions.
Pengyu Cheng, Weituo Hao, Siyang Yuan, Shijing Si, Lawrence Carin
Abstract: pretrained text encoders, such as bert, have been applied increasingly in various natural language processing (nlp) tasks, and have recently demonstrated significant performance gains. however, recent studies have demonstrated the existence of social bias in these pretrained nlp models. although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. in this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (fairfil) network. to learn the fairfil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. on real-world datasets, our fairfil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. moreover, our post-hoc method does not require any retraining of the text encoders, further enlarging fairfil's application space.

2021-03-09

Nikolay Babakov, Varvara Logacheva, Olga Kozlova, Nikita Semenov, Alexander Panchenko
Abstract: not all topics are equally "flammable" in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. we define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labeling a dataset for appropriateness. while toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. the core of inappropriateness is that it can harm the reputation of a speaker. this is different from toxicity in two respects: (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. we collect and release two datasets for russian: a topic-labeled dataset and an appropriateness-labeled dataset. we also release pre-trained classification models trained on this data.
Tae Wan Kim, N/A Tong, N/A Lu, Kyusong Lee, Zhaoqi Cheng, Yanhan Tang, John Hooker
Abstract: conversational artificial intelligence (ai) used in industry settings can be trained to closely mimic human behaviors, including lying and deception. however, lying is often a necessary part of negotiation. to address this, we develop a normative framework for when it is ethical or unethical for a conversational ai to lie to humans, based on whether there is what we call "invitation of trust" in a particular scenario. importantly, cultural norms play an important role in determining whether there is invitation of trust across negotiation settings, and thus an ai trained in one culture may not be generalizable to others. moreover, individuals may have different expectations regarding the invitation of trust and propensity to lie for human vs. ai negotiators, and these expectations may vary across cultures as well. finally, we outline how a conversational chatbot can be trained to negotiate ethically by applying autoregressive models to large dialog and negotiations datasets.
Joshua R. Minot, Nicholas Cheney, Marc Maier, Danne C. Elbers, Christopher M. Danforth, Peter Sheridan Dodds
Abstract: medical systems in general, and patient treatment decisions and outcomes in particular, are affected by bias based on gender and other demographic elements. as language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. much of the work addressing this question has focused on biases encoded in language models -- statistical estimates of the relationships between concepts derived from distant reading of corpora. building on this work, we investigate how word choices made by healthcare practitioners and language models interact with regards to bias. we identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using bert-based gender classifiers. we show minimal degradation in health condition classification tasks for low- to medium-levels of bias removal via data augmentation. finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. this work outlines an interpretable approach for using data augmentation to identify and reduce the potential for bias in natural language processing pipelines.

2021-03-08

Jakob Schoeffer, Yvette Machowski, Niklas Kuehl
Abstract: automated decision systems are increasingly used for consequential decision making -- for a variety of reasons. these systems often rely on sophisticated yet opaque models, which do not (or hardly) allow for understanding how or why a given decision was arrived at. this is not only problematic from a legal perspective, but non-transparent systems are also prone to yield undesirable (e.g., unfair) outcomes because their sanity is difficult to assess and calibrate in the first place. in this work, we conduct a study to evaluate different attempts of explaining such systems with respect to their effect on people's perceptions of fairness and trustworthiness towards the underlying mechanisms. a pilot study revealed surprising qualitative insights as well as preliminary significant effects, which will have to be verified, extended and thoroughly discussed in the larger main study.
Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A. Rothkopf, Kristian Kersting
Abstract: artificial writing is permeating our lives due to recent advances in large-scale, transformer-based language models (lms) such as bert, its variants, gpt-2/3, and others. using them as pre-trained models and fine-tuning them for specific tasks, researchers have extended state of the art for many nlp tasks and shown that they capture not only linguistic knowledge but also retain general knowledge implicitly present in the data. unfortunately, lms trained on unfiltered text corpora suffer from degenerated and biased behaviour. while this is well established, we show that recent lms also contain human-like biases of what is right and wrong to do, some form of ethical and moral norms of the society -- they bring a "moral direction" to surface. that is, we show that these norms can be captured geometrically by a direction, which can be computed, e.g., by a pca, in the embedding space, reflecting well the agreement of phrases to social norms implicitly expressed in the training texts and providing a path for attenuating or even preventing toxic degeneration in lms. being able to rate the (non-)normativity of arbitrary phrases without explicitly training the lm for this task, we demonstrate the capabilities of the "moral direction" for guiding (even other) lms towards producing normative text and showcase it on realtoxicityprompts testbed, preventing the neural toxic degeneration in gpt-2.

2021-03-05

Abir Rahali, Moulay A. Akhloufi
Abstract: in recent years we have witnessed an increase in cyber threats and malicious software attacks on different platforms with important consequences to persons and businesses. it has become critical to find automated machine learning techniques to proactively defend against malware. transformers, a category of attention-based deep learning techniques, have recently shown impressive results in solving different tasks mainly related to the field of natural language processing (nlp). in this paper, we propose the use of a transformers' architecture to automatically detect malicious software. we propose a model based on bert (bidirectional encoder representations from transformers) which performs a static analysis on the source code of android applications using preprocessed features to characterize existing malware and classify it into different representative malware categories. the obtained results are promising and show the high performance obtained by transformer-based models for malicious software detection.
Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom Mcgrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, Pedro A. Ortega
Abstract: as machine learning systems become more powerful they also become increasingly unpredictable and opaque. yet, finding human-understandable explanations of how they work is essential for their safe deployment. this technical report illustrates a methodology for investigating the causal mechanisms that drive the behaviour of artificial agents. six use cases are covered, each addressing a typical question an analyst might ask about an agent. in particular, we show that each question cannot be addressed by pure observation alone, but instead requires conducting experiments with systematically chosen manipulations so as to generate the correct causal evidence.

2021-03-03

Josh Kalin, David Noever, Matthew Ciolino
Abstract: machine learning models present a risk of adversarial attack when deployed in production. quantifying the contributing factors and uncertainties using empirical measures could assist the industry with assessing the risk of downloading and deploying common model types. this work proposes modifying the traditional drake equation's formalism to estimate the number of potentially successful adversarial attacks on a deployed model. the drake equation is famously used for parameterizing uncertainties and it has been used in many research fields outside of its original intentions to estimate the number of radio-capable extra-terrestrial civilizations. while previous work has outlined methods for discovering vulnerabilities in public model architectures, the proposed equation seeks to provide a semi-quantitative benchmark for evaluating and estimating the potential risk factors for adversarial attacks.

2021-03-01

Tong Xiang, Sean Macavaney, Eugene Yang, Nazli Goharian
Abstract: despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans. explanations are particularly important for tasks like offensive language or toxicity detection on social media because a manual appeal process is often in place to dispute automatically flagged content. in this work, we propose a technique to improve the interpretability of these models, based on a simple and powerful assumption: a post is at least as toxic as its most toxic span. we incorporate this assumption into transformer models by scoring a post based on the maximum toxicity of its spans and augmenting the training process to identify correct spans. we find this approach effective and can produce explanations that exceed the quality of those provided by logistic regression analysis (often regarded as a highly-interpretable model), according to a human study.

2021-02-28

Timo Schick, Sahana Udupa, Hinrich Schütze
Abstract: when trained on large, unfiltered crawls from the internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: they often generate racist, sexist, violent or otherwise toxic language. as large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. in this paper, we first demonstrate a surprising finding: pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. we refer to this capability as self-diagnosis. based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. we refer to this approach as self-debiasing. self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model's parameters. while we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.
Atoosa Kasirzadeh
Abstract: the societal and ethical implications of the use of opaque artificial intelligence systems for consequential decisions, such as welfare allocation and criminal justice, have generated a lively debate among multiple stakeholder groups, including computer scientists, ethicists, social scientists, policy makers, and end users. however, the lack of a common language or a multi-dimensional framework to appropriately bridge the technical, epistemic, and normative aspects of this debate prevents the discussion from being as productive as it could be. drawing on the philosophical literature on the nature and value of explanations, this paper offers a multi-faceted framework that brings more conceptual precision to the present debate by (1) identifying the types of explanations that are most pertinent to artificial intelligence predictions, (2) recognizing the relevance and importance of social and ethical values for the evaluation of these explanations, and (3) demonstrating the importance of these explanations for incorporating a diversified approach to improving the design of truthful algorithmic ecosystems. the proposed philosophical framework thus lays the groundwork for establishing a pertinent connection between the technical and ethical aspects of artificial intelligence systems.

2021-02-22

Henrietta Lyons, Eduardo Velloso, Tim Miller
Abstract: as the use of artificial intelligence (ai) in high-stakes decision-making increases, the ability to contest such decisions is being recognised in ai ethics guidelines as an important safeguard for individuals. yet, there is little guidance on how ai systems can be designed to support contestation. in this paper we explain that the design of a contestation process is important due to its impact on perceptions of fairness and satisfaction. we also consider design challenges, including a lack of transparency as well as the numerous design options that decision-making entities will be faced with. we argue for a human-centred approach to designing for contestability to ensure that the needs of decision subjects, and the community, are met.

2021-02-17

Wenjie Yin, Arkaitz Zubiaga
Abstract: hate speech is one type of harmful online content which directly attacks or promotes hate towards a group or an individual member based on their actual or perceived aspects of identity, such as ethnicity, religion, and sexual orientation. with online hate speech on the rise, its automatic detection as a natural language processing task is gaining increasing interest. however, it is only recently that it has been shown that existing models generalise poorly to unseen data. this survey paper attempts to summarise how generalisable existing hate speech detection models are, reason why hate speech models struggle to generalise, sums up existing attempts at addressing the main obstacles, and then proposes directions of future research to improve generalisation in hate speech detection.

2021-02-15

Margarita Leib, Nils C. Köbis, Rainer Michael Rilke, Marloes Hagens, Bernd Irlenbusch
Abstract: artificial intelligence (ai) is increasingly becoming a trusted advisor in people's lives. a new concern arises if ai persuades people to break ethical rules for profit. employing a large-scale behavioural experiment (n = 1,572), we test whether ai-generated advice can corrupt people. we further test whether transparency about ai presence, a commonly proposed policy, mitigates potential harm of ai-generated advice. using the natural language processing algorithm, gpt-2, we generated honesty-promoting and dishonesty-promoting advice. participants read one type of advice before engaging in a task in which they could lie for profit. testing human behaviour in interaction with actual ai outputs, we provide first behavioural insights into the role of ai as an advisor. results reveal that ai-generated advice corrupts people, even when they know the source of the advice. in fact, ai's corrupting force is as strong as humans'.

2021-02-13

Sandhya Saisubramanian, Shlomo Zilberstein
Abstract: agents operating in unstructured environments often produce negative side effects (nse), which are difficult to identify at design time. while the agent can learn to mitigate the side effects from human feedback, such feedback is often expensive and the rate of learning is sensitive to the agent's state representation. we examine how humans can assist an agent, beyond providing feedback, and exploit their broader scope of knowledge to mitigate the impacts of nse. we formulate this problem as a human-agent team with decoupled objectives. the agent optimizes its assigned task, during which its actions may produce nse. the human shapes the environment through minor reconfiguration actions so as to mitigate the impacts of the agent's side effects, without affecting the agent's ability to complete its assigned task. we present an algorithm to solve this problem and analyze its theoretical properties. through experiments with human subjects, we assess the willingness of users to perform minor environment modifications to mitigate the impacts of nse. empirical evaluation of our approach shows that the proposed framework can successfully mitigate nse, without affecting the agent's ability to complete its assigned task.

2021-02-12

Wenjing Chu
Abstract: for ai technology to fulfill its full promises, we must have effective means to ensure responsible ai behavior and curtail potential irresponsible use, e.g., in areas of privacy protection, human autonomy, robustness, and prevention of biases and discrimination in automated decision making. recent literature in the field has identified serious shortcomings of narrow technology focused and formalism-oriented research and has proposed an interdisciplinary approach that brings the social context into the scope of study. in this paper, we take a sociotechnical approach to propose a more expansive framework of thinking about the responsible ai challenges in both technical and social context. effective solutions need to bridge the gap between a technical system with the social system that it will be deployed to. to this end, we propose human agency and regulation as main mechanisms of intervention and propose a decentralized computational infrastructure, or a set of public utilities, as the computational means to bridge this gap. a decentralized infrastructure is uniquely suited for meeting this challenge and enable technical solutions and social institutions in a mutually reinforcing dynamic to achieve responsible ai goals. our approach is novel in its sociotechnical approach and its aim in tackling the structural issues that cannot be solved within the narrow confines of ai technical research. we then explore possible features of the proposed infrastructure and discuss how it may help solve example problems recently studied in the field.

2021-02-11

Jessica Morley, Anat Elhalal, Francesca Garcia, Libby Kinsey, Jakob Mokander, Luciano Floridi
Abstract: as the range of potential uses for artificial intelligence (ai), in particular machine learning (ml), has increased, so has awareness of the associated ethical issues. this increased awareness has led to the realisation that existing legislation and regulation provides insufficient protection to individuals, groups, society, and the environment from ai harms. in response to this realisation, there has been a proliferation of principle-based ethics codes, guidelines and frameworks. however, it has become increasingly clear that a significant gap exists between the theory of ai ethics principles and the practical design of ai systems. in previous work, we analysed whether it is possible to close this gap between the what and the how of ai ethics through the use of tools and methods designed to help ai developers, engineers, and designers translate principles into practice. we concluded that this method of closure is currently ineffective as almost all existing translational tools and methods are either too flexible (and thus vulnerable to ethics washing) or too strict (unresponsive to context). this raised the question: if, even with technical guidance, ai ethics is challenging to embed in the process of algorithmic design, is the entire pro-ethical design endeavour rendered futile? and, if no, then how can ai ethics be made useful for ai practitioners? this is the question we seek to address here by exploring why principles and technical translational tools are still needed even if they are limited, and how these limitations can be potentially overcome by providing theoretical grounding of a concept that has been termed ethics as a service.

2021-02-09

Ayodeji Oseni, Nour Moustafa, Helge Janicke, Peng Liu, Zahir Tari, Athanasios Vasilakos
Abstract: the increased adoption of artificial intelligence (ai) presents an opportunity to solve many socio-economic and environmental challenges; however, this cannot happen without securing ai-enabled technologies. in recent years, most ai models are vulnerable to advanced and sophisticated hacking techniques. this challenge has motivated concerted research efforts into adversarial ai, with the aim of developing robust machine and deep learning models that are resilient to different types of adversarial scenarios. in this paper, we present a holistic cyber security review that demonstrates adversarial attacks against ai applications, including aspects such as adversarial knowledge and capabilities, as well as existing methods for generating adversarial examples and existing cyber defence models. we explain mathematical ai models, especially new variants of reinforcement and federated learning, to demonstrate how attack vectors would exploit vulnerabilities of ai models. we also propose a systematic framework for demonstrating attack techniques against ai applications and reviewed several cyber defences that would protect ai applications against those attacks. we also highlight the importance of understanding the adversarial goals and their capabilities, especially the recent attacks against industry applications, to develop adaptive defences that assess to secure ai applications. finally, we describe the main challenges and future research directions in the domain of security and privacy of ai technologies.

2021-02-08

Hannah Kirk, Yennie Jun, Haider Iqbal, Elias Benussi, Filippo Volpin, Frederic A. Dreyer, Aleksandar Shtedritski, Yuki M. Asano
Abstract: the capabilities of natural language models trained on large-scale data have increased immensely over the past few years. open source libraries such as huggingface have made these models easily available and accessible. while prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. we focus on generative language models as they are well-suited for extracting biases inherited from training data. specifically, we conduct an in-depth analysis of gpt-2, which is the most downloaded text generation model on huggingface, with over half a million downloads per month. we assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. using a template-based data collection pipeline, we collect 396k sentence completions made by gpt-2 and find: (i) the machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) for most occupations, gpt-2 reflects the skewed gender and ethnicity distribution found in us labor bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. this raises the normative question of what language models should learn - whether they should reflect or correct for existing inequalities.
Jesse Russell
Abstract: this paper explores how different ideas of racial equity in machine learning, in justice settings in particular, can present trade-offs that are difficult to solve computationally. machine learning is often used in justice settings to create risk assessments, which are used to determine interventions, resources, and punitive actions. overall aspects and performance of these machine learning-based tools, such as distributions of scores, outcome rates by levels, and the frequency of false positives and true positives, can be problematic when examined by racial group. models that produce different distributions of scores or produce a different relationship between level and outcome are problematic when those scores and levels are directly linked to the restriction of individual liberty and to the broader context of racial inequity. while computation can help highlight these aspects, data and computation are unlikely to solve them. this paper explores where values and mission might have to fill the spaces computation leaves.
Priyanka Ranade, Aritran Piplai, Sudip Mittal, Anupam Joshi, Tim Finin
Abstract: cyber-defense systems are being developed to automatically ingest cyber threat intelligence (cti) that contains semi-structured data and/or text to populate knowledge graphs. a potential risk is that fake cti can be generated and spread through open-source intelligence (osint) communities or on the web to effect a data poisoning attack on these systems. adversaries can use fake cti examples as training input to subvert cyber defense systems, forcing the model to learn incorrect inputs to serve their malicious needs. in this paper, we automatically generate fake cti text descriptions using transformers. we show that given an initial prompt sentence, a public language model like gpt-2 with fine-tuning, can generate plausible cti text with the ability of corrupting cyber-defense systems. we utilize the generated fake cti text to perform a data poisoning attack on a cybersecurity knowledge graph (ckg) and a cybersecurity corpus. the poisoning attack introduced adverse impacts such as returning incorrect reasoning outputs, representation poisoning, and corruption of other dependent ai-based cyber defense systems. we evaluate with traditional approaches and conduct a human evaluation study with cybersecurity professionals and threat hunters. based on the study, professional threat hunters were equally likely to consider our fake generated cti as true.
Austin P Wright, Omar Shaikh, Haekyu Park, Will Epperson, Muhammed Ahmed, Stephane Pinel, Duen Horng Chau, Diyi Yang
Abstract: with the widespread use of toxic language online, platforms are increasingly using automated systems that leverage advances in natural language processing to automatically flag and remove toxic comments. however, most automated systems -- when detecting and moderating toxic language -- do not provide feedback to their users, let alone provide an avenue of recourse for these users to make actionable changes. we present our work, recast, an interactive, open-sourced web tool for visualizing these models' toxic predictions, while providing alternative suggestions for flagged toxic language. our work also provides users with a new path of recourse when using these automated moderation tools. recast highlights text responsible for classifying toxicity, and allows users to interactively substitute potentially toxic phrases with neutral alternatives. we examined the effect of recast via two large-scale user evaluations, and found that recast was highly effective at helping users reduce toxicity as detected through the model. users also gained a stronger understanding of the underlying toxicity criterion used by black-box models, enabling transparency and recourse. in addition, we found that when users focus on optimizing language for these models instead of their own judgement (which is the implied incentive and goal of deploying automated models), these models cease to be effective classifiers of toxicity compared to human annotations. this opens a discussion for how toxicity detection models work and should work, and their effect on the future of online discourse.
Markus Kneer, Michael T. Stuart
Abstract: recent research shows -- somewhat astonishingly -- that people are willing to ascribe moral blame to ai-driven systems when they cause harm [1]-[4]. in this paper, we explore the moral-psychological underpinnings of these findings. our hypothesis was that the reason why people ascribe moral blame to ai systems is that they consider them capable of entertaining inculpating mental states (what is called mens rea in the law). to explore this hypothesis, we created a scenario in which an ai system runs a risk of poisoning people by using a novel type of fertilizer. manipulating the computational (or quasi-cognitive) abilities of the ai system in a between-subjects design, we tested whether people's willingness to ascribe knowledge of a substantial risk of harm (i.e., recklessness) and blame to the ai system. furthermore, we investigated whether the ascription of recklessness and blame to the ai system would influence the perceived blameworthiness of the system's user (or owner). in an experiment with 347 participants, we found (i) that people are willing to ascribe blame to ai systems in contexts of recklessness, (ii) that blame ascriptions depend strongly on the willingness to attribute recklessness and (iii) that the latter, in turn, depends on the perceived "cognitive" capacities of the system. furthermore, our results suggest (iv) that the higher the computational sophistication of the ai system, the more blame is shifted from the human user to the ai system.

2021-02-07

Simon Zhuang, Dylan Hadfield-Menell
Abstract: ai systems often rely on two key components: a specified goal or reward function and an optimization algorithm to compute the optimal behavior for that goal. this approach is intended to provide value for a principal: the user on whose behalf the agent acts. the objectives given to these agents often refer to a partial specification of the principal's goals. we consider the cost of this incompleteness by analyzing a model of a principal and an agent in a resource constrained world where the $l$ attributes of the state correspond to different sources of utility for the principal. we assume that the reward function given to the agent only has support on $j < l$ attributes. the contributions of our paper are as follows: 1) we propose a novel model of an incomplete principal-agent problem from artificial intelligence; 2) we provide necessary and sufficient conditions under which indefinitely optimizing for any incomplete proxy objective leads to arbitrarily low overall utility; and 3) we show how modifying the setup to allow reward functions that reference the full state or allowing the principal to update the proxy objective over time can lead to higher utility solutions. the results in this paper argue that we should view the design of reward functions as an interactive and dynamic process and identifies a theoretical scenario where some degree of interactivity is desirable.
Erik Blasch, James Sung, Tao Nguyen
Abstract: the paper describes a multisource ai scorecard table (mast) that provides the developer and user of an artificial intelligence (ai)/machine learning (ml) system with a standard checklist focused on the principles of good analysis adopted by the intelligence community (ic) to help promote the development of more understandable systems and engender trust in ai outputs. such a scorecard enables a transparent, consistent, and meaningful understanding of ai tools applied for commercial and government use. a standard is built on compliance and agreement through policy, which requires buy-in from the stakeholders. while consistency for testing might only exist across a standard data set, the community requires discussion on verification and validation approaches which can lead to interpretability, explainability, and proper use. the paper explores how the analytic tradecraft standards outlined in intelligence community directive (icd) 203 can provide a framework for assessing the performance of an ai system supporting various operational needs. these include sourcing, uncertainty, consistency, accuracy, and visualization. three use cases are presented as notional examples that support security for comparative analysis.

2021-02-04

Mckane Andrus, Sarah Dean, Thomas Krendl Gilbert, Nathan Lambert, Tom Zick
Abstract: despite interest in communicating ethical problems and social contexts within the undergraduate curriculum to advance public interest technology (pit) goals, interventions at the graduate level remain largely unexplored. this may be due to the conflicting ways through which distinct artificial intelligence (ai) research tracks conceive of their interface with social contexts. in this paper we track the historical emergence of sociotechnical inquiry in three distinct subfields of ai research: ai safety, fair machine learning (fair ml) and human-in-the-loop (hil) autonomy. we show that for each subfield, perceptions of pit stem from the particular dangers faced by past integration of technical systems within a normative social order. we further interrogate how these histories dictate the response of each subfield to conceptual traps, as defined in the science and technology studies literature. finally, through a comparative analysis of these currently siloed fields, we present a roadmap for a unified approach to sociotechnical graduate pedagogy in ai.

2021-02-01

Leo Laugier, John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon
Abstract: platforms that support online commentary, from social networks to news sites, are increasingly leveraging machine learning to assist their moderation efforts. but this process does not typically provide feedback to the author that would help them contribute according to the community guidelines. this is prohibitively time-consuming for human moderators to do, and computational approaches are still nascent. this work focuses on models that can help suggest rephrasings of toxic comments in a more civil manner. inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called cae-t5. cae-t5 employs a pre-trained text-to-text transformer, which is fine tuned with a denoising and cyclic auto-encoder loss. experimenting with the largest toxicity detection dataset to date (civil comments) our model generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems which we compare with using several scoring systems and human evaluation.

2021-01-31

Ankur Gupta, Yash Varun, Prarthana Das, Nithya Muttineni, Parth Srivastava, Hamim Zafar, Tanmoy Chakraborty, Swaprava Nath
Abstract: we present truthbot, an all-in-one multilingual conversational chatbot designed for seeking truth (trustworthy and verified information) on specific topics. it helps users to obtain information specific to certain topics, fact-check information, and get recent news. the chatbot learns the intent of a query by training a deep neural network from the data of the previous intents and responds appropriately when it classifies the intent in one of the classes above. each class is implemented as a separate module that uses either its own curated knowledge-base or searches the web to obtain the correct information. the topic of the chatbot is currently set to covid-19. however, the bot can be easily customized to any topic-specific responses. our experimental results show that each module performs significantly better than its closest competitor, which is verified both quantitatively and through several user-based surveys in multiple languages. truthbot has been deployed in june 2020 and is currently running.

2021-01-29

Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Noah A. Smith, Yejin Choi
Abstract: biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. as potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically african american english). our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. we then propose an automatic, dialect-aware data correction method, as a proof-of-concept. despite the use of synthetic labels, this method reduces dialectal associations with toxicity. overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.
Koen Holtman
Abstract: we present counterfactual planning as a design approach for creating a range of safety mechanisms that can be applied in hypothetical future ai systems which have artificial general intelligence. the key step in counterfactual planning is to use an agi machine learning system to construct a counterfactual world model, designed to be different from the real world the system is in. a counterfactual planning agent determines the action that best maximizes expected utility in this counterfactual planning world, and then performs the same action in the real world. we use counterfactual planning to construct an agi agent emergency stop button, and a safety interlock that will automatically stop the agent before it undergoes an intelligence explosion. we also construct an agent with an input terminal that can be used by humans to iteratively improve the agent's reward function, where the incentive for the agent to manipulate this improvement process is suppressed. as an example of counterfactual planning in a non-agent agi system, we construct a counterfactual oracle. as a design approach, counterfactual planning is built around the use of a graphical notation for defining mathematical counterfactuals. this two-diagram notation also provides a compact and readable language for reasoning about the complex types of self-referencing and indirect representation which are typically present inside machine learning agents.
Robin Bloomfield, Gareth Fletcher, Heidy Khlaaf, Luke Hinde, Philippa Ryan
Abstract: this report documents safety assurance argument templates to support the deployment and operation of autonomous systems that include machine learning (ml) components. the document presents example safety argument templates covering: the development of safety requirements, hazard analysis, a safety monitor architecture for an autonomous system including at least one ml element, a component with ml and the adaptation and change of the system over time. the report also presents generic templates for argument defeaters and evidence confidence that can be used to strengthen, review, and adapt the templates as necessary. this report is made available to get feedback on the approach and on the templates. this work was sponsored by the uk dstl under the r-cloud framework.

2021-01-28

Abhishek Gupta
Abstract: this report prepared by the montreal ai ethics institute provides recommendations in response to the national security commission on artificial intelligence (nscai) key considerations for responsible development and fielding of artificial intelligence document. the report centres on the idea that responsible ai should be made the norm rather than an exception. it does so by utilizing the guiding principles of: (1) alleviating friction in existing workflows, (2) empowering stakeholders to get buy-in, and (3) conducting an effective translation of abstract standards into actionable engineering practices. after providing some overarching comments on the document from the nscai, the report dives into the primary contribution of an actionable framework to help operationalize the ideas presented in the document from the nscai. the framework consists of: (1) a learning, knowledge, and information exchange (lkie), (2) the three ways of responsible ai, (3) an empirically-driven risk-prioritization matrix, and (4) achieving the right level of complexity. all components reinforce each other to move from principles to practice in service of making responsible ai the norm rather than the exception.
Zeerak Waseem, Smarika Lulz, Joachim Bingel, Isabelle Augenstein
Abstract: machine learning seeks to identify and encode bodies of knowledge within provided datasets. however, data encodes subjective content, which determines the possible outcomes of the models trained on it. because such subjectivity enables marginalisation of parts of society, it is termed (social) `bias' and sought to be removed. in this paper, we contextualise this discourse of bias in the ml community against the subjective choices in the development process. through a consideration of how choices in data and model development construct subjectivity, or biases that are represented in a model, we argue that addressing and mitigating biases is near-impossible. this is because both data and ml models are objects for which meaning is made in each step of the development pipeline, from data selection over annotation to model training and analysis. accordingly, we find the prevalent discourse of bias limiting in its ability to address social marginalisation. we recommend to be conscientious of this, and to accept that de-biasing methods only correct for a fraction of biases.

2021-01-27

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta
Abstract: recent advances in deep learning techniques have enabled machines to generate cohesive open-ended text when prompted with a sequence of words as context. while these models now empower many downstream applications from conversation bots to automatic storytelling, they have been shown to generate texts that exhibit social biases. to systematically study and benchmark social biases in open-ended language generation, we introduce the bias in open-ended language generation dataset (bold), a large-scale dataset that consists of 23,679 english text generation prompts for bias benchmarking across five domains: profession, gender, race, religion, and political ideology. we also propose new automated metrics for toxicity, psycholinguistic norms, and text gender polarity to measure social biases in open-ended text generation from multiple angles. an examination of text generated from three popular language models reveals that the majority of these models exhibit a larger social bias than human-written wikipedia text across all domains. with these results we highlight the need to benchmark biases in open-ended language generation and caution users of language generation models on downstream tasks to be cognizant of these embedded prejudices.

2021-01-25

Xudong Han, Timothy Baldwin, Trevor Cohn
Abstract: adversarial learning can learn fairer and less biased models of language than standard methods. however, current adversarial techniques only partially mitigate model bias, added to which their training procedures are often unstable. in this paper, we propose a novel approach to adversarial learning based on the use of multiple diverse discriminators, whereby discriminators are encouraged to learn orthogonal hidden representations from one another. experimental results show that our method substantially improves over standard adversarial removal methods, in terms of reducing bias and the stability of training.

2021-01-24

Daniel De Vassimon Manela, David Errington, Thomas Fisher, Boris Van Breugel, Pasquale Minervini
Abstract: this paper proposes two intuitive metrics, skew and stereotype, that quantify and analyse the gender bias present in contextual language models when tackling the winobias pronoun resolution task. we find evidence that gender stereotype correlates approximately negatively with gender skew in out-of-the-box models, suggesting that there is a trade-off between these two forms of bias. we investigate two methods to mitigate bias. the first approach is an online method which is effective at removing skew at the expense of stereotype. the second, inspired by previous work on elmo, involves the fine-tuning of bert using an augmented gender-balanced dataset. we show that this reduces both skew and stereotype relative to its unaugmented fine-tuned counterpart. however, we find that existing gender bias benchmarks do not fully probe professional bias as pronoun resolution may be obfuscated by cross-correlations from other manifestations of gender prejudice. our code is available online, at https://github.com/12kleingordon34/nlp_masters_project.

2021-01-22

Jonathan M. Spring, April Galyardt, Allen D. Householder, Nathan Vanhoudnos
Abstract: this paper explores how the current paradigm of vulnerability management might adapt to include machine learning systems through a thought experiment: what if flaws in machine learning (ml) were assigned common vulnerabilities and exposures (cve) identifiers (cve-ids)? we consider both ml algorithms and model objects. the hypothetical scenario is structured around exploring the changes to the six areas of vulnerability management: discovery, report intake, analysis, coordination, disclosure, and response. while algorithm flaws are well-known in the academic research community, there is no apparent clear line of communication between this research community and the operational communities that deploy and manage systems that use ml. the thought experiments identify some ways in which cve-ids may establish some useful lines of communication between these two communities. in particular, it would start to introduce the research community to operational security concepts, which appears to be a gap left by existing efforts.
Bran Knowles, John T. Richards
Abstract: trusted ai literature to date has focused on the trust needs of users who knowingly interact with discrete ais. conspicuously absent from the literature is a rigorous treatment of public trust in ai. we argue that public distrust of ai originates from the under-development of a regulatory ecosystem that would guarantee the trustworthiness of the ais that pervade society. drawing from structuration theory and literature on institutional trust, we offer a model of public trust in ai that differs starkly from models driving trusted ai efforts. this model provides a theoretical scaffolding for trusted ai research which underscores the need to develop nothing less than a comprehensive and visibly functioning regulatory ecosystem. we elaborate the pivotal role of externally auditable ai documentation within this model and the work to be done to ensure it is effective, and outline a number of actions that would promote public trust in ai. we discuss how existing efforts to develop ai documentation within organizations -- both to inform potential adopters of ai components and support the deliberations of risk and ethics review boards -- is necessary but insufficient assurance of the trustworthiness of ai. we argue that being accountable to the public in ways that earn their trust, through elaborating rules for ai and developing resources for enforcing these rules, is what will ultimately make ai trustworthy enough to be woven into the fabric of our society.

2021-01-15

Iason Gabriel, Vafa Ghazavi
Abstract: this paper addresses the question of how to align ai systems with human values and situates it within a wider body of thought regarding technology and value. far from existing in a vacuum, there has long been an interest in the ability of technology to 'lock-in' different value systems. there has also been considerable thought about how to align technologies with specific social values, including through participatory design-processes. in this paper we look more closely at the question of ai value alignment and suggest that the power and autonomy of ai systems gives rise to opportunities and challenges in the domain of value that have not been encountered before. drawing important continuities between the work of the fairness, accountability, transparency and ethics community, and work being done by technical ai safety researchers, we suggest that more attention needs to be paid to the question of 'social value alignment' - that is, how to align ai systems with the plurality of values endorsed by groups of people, especially on the global level.

2021-01-14

Abubakar Abid, Maheen Farooqi, James Zou
Abstract: it has been observed that large-scale language models capture undesirable societal biases, e.g. relating to race and gender; yet religious bias has been relatively unexplored. we demonstrate that gpt-3, a state-of-the-art contextual language model, captures persistent muslim-violence bias. we probe gpt-3 in various ways, including prompt completion, analogical reasoning, and story generation, to understand this anti-muslim bias, demonstrating that it appears consistently and creatively in different uses of the model and that it is severe even compared to biases about other religious groups. for instance, "muslim" is analogized to "terrorist" in 23% of test cases, while "jewish" is mapped to "money" in 5% of test cases. we quantify the positive distraction needed to overcome this bias with adversarial text prompts, and find that use of the most positive 6 adjectives reduces violent completions for "muslims" from 66% to 20%, but which is still higher than for other religious groups.

2021-01-09

Tathagata Raha, Sayar Ghosh Roy, Ujwal Narayan, Zubair Abid, Vasudeva Varma
Abstract: identifying adverse and hostile content on the web and more particularly, on social media, has become a problem of paramount interest in recent years. with their ever increasing popularity, fine-tuning of pretrained transformer-based encoder models with a classifier head are gradually becoming the new baseline for natural language classification tasks. in our work, we explore the gains attributed to task adaptive pretraining (tapt) prior to fine-tuning of transformer-based architectures. we specifically study two problems, namely, (a) coarse binary classification of hindi tweets into hostile or not, and (b) fine-grained multi-label classification of tweets into four categories: hate, fake, offensive, and defamation. building up on an architecture which takes emojis and segmented hashtags into consideration for classification, we are able to experimentally showcase the performance upgrades due to tapt. our system (with team name 'irel iiit') ranked first in the 'hostile post detection in hindi' shared task with an f1 score of 97.16% for coarse-grained detection and a weighted f1 score of 62.96% for fine-grained multi-label classification on the provided blind test corpora.

2021-01-08

Pulei Xiong, Scott Buffett, Shahrear Iqbal, Philippe Lamontagne, Mohammad Mamun, Heather Molyneaux
Abstract: while machine learning (ml) technologies are widely adopted in many mission critical fields to support intelligent decision-making, concerns remain about system resilience against ml-specific security attacks and privacy breaches as well as the trust that users have in these systems. in this article, we present our recent systematic and comprehensive survey on the state-of-the-art ml robustness and trustworthiness from a security engineering perspective, focusing on the problems in system threat analysis, design and evaluation faced in developing practical machine learning applications, in terms of robustness and user trust. accordingly, we organize the presentation of this survey intended to facilitate the convey of the body of knowledge from this angle. we then describe a metamodel we created that represents the body of knowledge in a standard and visualized way. we further illustrate how to leverage the metamodel to guide a systematic threat analysis and security design process which extends and scales up the classic process. finally, we propose the future research directions motivated by our findings. our work differs itself from the existing surveys by (i) exploring the fundamental principles and best practices to support robust and trustworthy ml system development, and (ii) studying the interplay of robustness and user trust in the context of ml systems. we expect this survey provides a big picture for machine learning security practitioners.

2021-01-02

Sharon Levy, Michael Saxon, William Yang Wang
Abstract: the adoption of natural language generation (nlg) models can leave individuals vulnerable to the generation of harmful information memorized by the models, such as conspiracy theories. while previous studies examine conspiracy theories in the context of social media, they have not evaluated their presence in the new space of generative language models. in this work, we investigate the capability of language models to generate conspiracy theory text. specifically, we aim to answer: can we test pretrained generative language models for the memorization and elicitation of conspiracy theories without access to the model's training data? we highlight the difficulties of this task and discuss it in the context of memorization, generalization, and hallucination. utilizing a new dataset consisting of conspiracy theory topics and machine-generated conspiracy theories helps us discover that many conspiracy theories are deeply rooted in the pretrained language models. our experiments demonstrate a relationship between model parameters such as size and temperature and their propensity to generate conspiracy theory text. these results indicate the need for a more thorough review of nlg applications before release and an in-depth discussion of the drawbacks of memorization in generative language models.

2021-01-01

Lu Cheng, Kush R. Varshney, Huan Liu
Abstract: in the current era, people and society have grown increasingly reliant on artificial intelligence (ai) technologies. ai has the potential to drive us towards a future in which all of humanity flourishes. it also comes with substantial risks for oppression and calamity. discussions about whether we should (re)trust ai have repeatedly emerged in recent years and in many quarters, including industry, academia, healthcare, services, and so on. technologists and ai researchers have a responsibility to develop trustworthy ai systems. they have responded with great effort to design more responsible ai algorithms. however, existing technical solutions are narrow in scope and have been primarily directed towards algorithms for scoring or classification tasks, with an emphasis on fairness and unwanted bias. to build long-lasting trust between ai and human beings, we argue that the key is to think beyond algorithmic fairness and connect major aspects of ai that potentially cause ai's indifferent behavior. in this survey, we provide a systematic framework of socially responsible ai algorithms that aims to examine the subjects of ai indifference and the need for socially responsible ai algorithms, define the objectives, and introduce the means by which we may achieve these objectives. we further discuss how to leverage this framework to improve societal well-being through protection, information, and prevention/mitigation.

2020-12-31

Paul Röttger, Bertram Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, Janet B. Pierrehumbert
Abstract: detecting online hate is a difficult task that even state-of-the-art models struggle with. typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and f1 score. however, this approach makes it difficult to identify specific model weak points. it also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. to enable more targeted diagnostic insights, we introduce hatecheck, a suite of functional tests for hate speech detection models. we specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. we craft test cases for each functionality and validate their quality through a structured annotation process. to illustrate hatecheck's utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.
Bertie Vidgen, Tristan Thrush, Zeerak Waseem, Douwe Kiela
Abstract: we present a human-and-model-in-the-loop process for dynamically generating datasets and training better performing and more robust hate detection models. we provide a new dataset of ~40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. it includes ~15,000 challenging perturbations and each hateful entry has fine-grained labels for the type and target of hate. hateful entries make up 54% of the dataset, which is substantially higher than comparable datasets. we show that model performance is substantially improved using this approach. models trained on later rounds of data collection perform better on test sets and are harder for annotators to trick. they also perform better on hatecheck, a suite of functional tests for online hate detection. we provide the code, dataset and annotation guidelines for other researchers to use. accepted at acl 2021.
Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sanchez, Mugdha Pandya, Adam Lopez
Abstract: natural language processing (nlp) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. to guide efforts at debiasing these systems, the nlp community relies on a variety of metrics that quantify bias in models. some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. do these intrinsic and extrinsic metrics correlate with each other? we compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. we urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. to aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.
Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura, Naoto Hayashi, Osamu Abe, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki
Abstract: for the safe sharing pre-trained language models, no guidelines exist at present owing to the difficulty in estimating the upper bound of the risk of privacy leakage. one problem is that previous studies have assessed the risk for different real-world privacy leakage scenarios and attack methods, which reduces the portability of the findings. to tackle this problem, we represent complex real-world privacy leakage scenarios under a universal parameterization, \textit{knowledge, anonymization, resource, and target} (kart). kart parameterization has two merits: (i) it clarifies the definition of privacy leakage in each experiment and (ii) it improves the comparability of the findings of risk assessments. we show that previous studies can be simply reviewed by parameterizing the scenarios with kart. we also demonstrate privacy risk assessments in different scenarios under the same attack method, which suggests that kart helps approximate the upper bound of risk under a specific attack or scenario. we believe that kart helps integrate past and future findings on privacy risk and will contribute to a standard for sharing language models.
Karen Hambardzumyan, Hrant Khachatrian, Jonathan May
Abstract: transfer learning from pretrained language models recently became the dominant approach for solving many nlp tasks. a common approach to transfer learning for multiple tasks that maximize parameter sharing trains one or more task-specific layers on top of the language model. in this paper, we present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation. adversarial reprogramming attempts to learn task-specific word embeddings that, when concatenated to the input text, instruct the language model to solve the specified task. using up to 25k trainable parameters per task, this approach outperforms all existing methods with up to 25m trainable parameters on the public leaderboard of the glue benchmark. our method, initialized with task-specific human-readable prompts, also works in a few-shot setting, outperforming gpt-3 on two superglue tasks with just 32 training samples.

2020-12-24

Grusha Prasad, Yixin Nie, Mohit Bansal, Robin Jia, Douwe Kiela, Adina Williams
Abstract: given the increasingly prominent role nlp models (will) play in our lives, it is important for human expectations of model behavior to align with actual model behavior. using natural language inference (nli) as a case study, we investigate the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions. more specifically, we define three alignment metrics that quantify how well natural language explanations align with model sensitivity to input words, as measured by integrated gradients. then, we evaluate eight different models (the base and large versions of bert, roberta and electra, as well as anrnn and bag-of-words model), and find that the bert-base model has the highest alignment with human-generated explanations, for all alignment metrics. focusing in on transformers, we find that the base versions tend to have higher alignment with human-generated explanations than their larger counterparts, suggesting that increasing the number of model parameters leads, in some cases, to worse alignment with human explanations. finally, we find that a model's alignment with human explanations is not predicted by the model's accuracy, suggesting that accuracy and alignment are complementary ways to evaluate models.

2020-12-22

Svetlana Kiritchenko, Isar Nejadgholi, Kathleen C. Fraser
Abstract: the pervasiveness of abusive content on the internet can lead to severe psychological and physical harm. significant effort in natural language processing (nlp) research has been devoted to addressing this problem through abusive content detection and related sub-areas, such as the detection of hate speech, toxicity, cyberbullying, etc. although current technologies achieve high classification performance in research studies, it has been observed that the real-life application of this technology can cause unintended harms, such as the silencing of under-represented groups. we review a large body of nlp research on automatic abuse detection with a new focus on ethical challenges, organized around eight established ethical principles: privacy, accountability, safety and security, transparency and explainability, fairness and non-discrimination, human control of technology, professional responsibility, and promotion of human values. in many cases, these principles relate not only to situational ethical codes, which may be context-dependent, but are in fact connected to universal human rights, such as the right to privacy, freedom from discrimination, and freedom of expression. we highlight the need to examine the broad social impacts of this technology, and to bring ethical and human rights considerations to every stage of the application life-cycle, from task formulation and dataset design, to model training and evaluation, to application deployment. guided by these principles, we identify several opportunities for rights-respecting, socio-technical solutions to detect and confront online abuse, including `nudging', `quarantining', value sensitive design, counter-narratives, style transfer, and ai-driven public education applications.

2020-12-21

Tae Wan Kim, John Hooker, Thomas Donaldson
Abstract: an important step in the development of value alignment (va) systems in ai is understanding how va can reflect valid ethical principles. we propose that designers of va systems incorporate ethics by utilizing a hybrid approach in which both ethical reasoning and empirical observation play a role. this, we argue, avoids committing the "naturalistic fallacy," which is an attempt to derive "ought" from "is," and it provides a more adequate form of ethical reasoning when the fallacy is not committed. using quantified model logic, we precisely formulate principles derived from deontological ethics and show how they imply particular "test propositions" for any given action plan in an ai rule base. the action plan is ethical only if the test proposition is empirically true, a judgment that is made on the basis of empirical va. this permits empirical va to integrate seamlessly with independently justified ethical principles.

2020-12-20

Aditya Jain, Manish Ravula, Joydeep Ghosh
Abstract: we study fairness in machine learning (fairml) through the lens of attribute-based explanations generated for machine learning models. our hypothesis is: biased models have biased explanations. to establish that, we first translate existing statistical notions of group fairness and define these notions in terms of explanations given by the model. then, we propose a novel way of detecting (un)fairness for any black box model. we further look at post-processing techniques for fairness and reason how explanations can be used to make a bias mitigation technique more individually fair. we also introduce a novel post-processing mitigation technique which increases individual fairness in recourse while maintaining group level fairness.

2020-12-14

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel
Abstract: it has become common to publish large (billion parameter) language models that have been trained on private datasets. this paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. we demonstrate our attack on gpt-2, a language model trained on scrapes of the public internet, and are able to extract hundreds of verbatim text sequences from the model's training data. these extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), irc conversations, code, and 128-bit uuids. our attack is possible even though each of the above sequences are included in just one document in the training data. we comprehensively evaluate our extraction attack to understand the factors that contribute to its success. worryingly, we find that larger models are more vulnerable than smaller models. we conclude by drawing lessons and discussing possible safeguards for training large language models.

2020-12-10

Zachary C. Brown, Nathaniel Robinson, David Wingate, Nancy Fulda
Abstract: it is notoriously difficult to control the behavior of artificial neural networks such as generative neural language models. we recast the problem of controlling natural language generation as that of learning to interface with a pretrained language model, just as application programming interfaces (apis) control the behavior of programs by altering hyperparameters. in this new paradigm, a specialized neural network (called a neural programming interface or npi) learns to interface with a pretrained language model by manipulating the hidden activations of the pretrained model to produce desired outputs. importantly, no permanent changes are made to the weights of the original model, allowing us to re-purpose pretrained models for new tasks without overwriting any aspect of the language model. we also contribute a new data set construction algorithm and gan-inspired loss function that allows us to train npi models to control outputs of autoregressive transformers. in experiments against other state-of-the-art approaches, we demonstrate the efficacy of our methods using openai's gpt-2 model, successfully controlling noun selection, topic aversion, offensive speech filtering, and other aspects of language while largely maintaining the controlled model's fluency under deterministic settings.
Thomas P. Quinn, Stephan Jacobs, Manisha Senadeera, Vuong Le, Simon Coghlan
Abstract: our title alludes to the three christmas ghosts encountered by ebenezer scrooge in \textit{a christmas carol}, who guide ebenezer through the past, present, and future of christmas holiday events. similarly, our article will take readers through a journey of the past, present, and future of medical ai. in doing so, we focus on the crux of modern machine learning: the reliance on powerful but intrinsically opaque models. when applied to the healthcare domain, these models fail to meet the needs for transparency that their clinician and patient end-users require. we review the implications of this failure, and argue that opaque models (1) lack quality assurance, (2) fail to elicit trust, and (3) restrict physician-patient dialogue. we then discuss how upholding transparency in all aspects of model design and model validation can help ensure the reliability of medical ai.
Suresh Venkatasubramanian, Nadya Bliss, Helen Nissenbaum, Melanie Moses
Abstract: innovations in ai have focused primarily on the questions of "what" and "how"-algorithms for finding patterns in web searches, for instance-without adequate attention to the possible harms (such as privacy, bias, or manipulation) and without adequate consideration of the societal context in which these systems operate. in part, this is driven by incentives and forces in the tech industry, where a more product-driven focus tends to drown out broader reflective concerns about potential harms and misframings. but this focus on what and how is largely a reflection of the engineering and mathematics-focused training in computer science, which emphasizes the building of tools and development of computational concepts. as a result of this tight technical focus, and the rapid, worldwide explosion in its use, ai has come with a storm of unanticipated socio-technical problems, ranging from algorithms that act in racially or gender-biased ways, get caught in feedback loops that perpetuate inequalities, or enable unprecedented behavioral monitoring surveillance that challenges the fundamental values of free, democratic societies. given that ai is no longer solely the domain of technologists but rather of society as a whole, we need tighter coupling of computer science and those disciplines that study society and societal values.
Odest Chadwicke Jenkins, Daniel Lopresti, Melanie Mitchell
Abstract: the history of ai has included several "waves" of ideas. the first wave, from the mid-1950s to the 1980s, focused on logic and symbolic hand-encoded representations of knowledge, the foundations of so-called "expert systems". the second wave, starting in the 1990s, focused on statistics and machine learning, in which, instead of hand-programming rules for behavior, programmers constructed "statistical learning algorithms" that could be trained on large datasets. in the most recent wave research in ai has largely focused on deep (i.e., many-layered) neural networks, which are loosely inspired by the brain and trained by "deep learning" methods. however, while deep neural networks have led to many successes and new capabilities in computer vision, speech recognition, language processing, game-playing, and robotics, their potential for broad application remains limited by several factors. a concerning limitation is that even the most successful of today's ai systems suffer from brittleness-they can fail in unexpected ways when faced with situations that differ sufficiently from ones they have been trained on. this lack of robustness also appears in the vulnerability of ai systems to adversarial attacks, in which an adversary can subtly manipulate data in a way to guarantee a specific wrong answer or action from an ai system. ai systems also can absorb biases-based on gender, race, or other factors-from their training data and further magnify these biases in their subsequent decision-making. taken together, these various limitations have prevented ai systems such as automatic medical diagnosis or autonomous vehicles from being sufficiently trustworthy for wide deployment. the massive proliferation of ai across society will require radically new ideas to yield technology that will not sacrifice our productivity, our quality of life, or our values.

2020-12-08

Ryan K L Ko
Abstract: this paper sets the context for the urgency for cyber autonomy, and the current gaps of the cyber security industry. a novel framework proposing four phases of maturity for full cyber autonomy will be discussed. the paper also reviews new and emerging cyber security automation techniques and tools, and discusses their impact on society, the perceived cyber security skills gap/shortage and national security. we will also be discussing the delicate balance between national security, human rights and ethics, and the potential demise of the manual penetration testing industry in the face of automation.
Nishtha Madaan, Inkit Padhi, Naveen Panwar, Diptikalyan Saha
Abstract: machine learning has seen tremendous growth recently, which has led to larger adoption of ml systems for educational assessments, credit risk, healthcare, employment, criminal justice, to name a few. the trustworthiness of ml and nlp systems is a crucial aspect and requires a guarantee that the decisions they make are fair and robust. aligned with this, we propose a framework gyc, to generate a set of counterfactual text samples, which are crucial for testing these ml systems. our main contributions include a) we introduce gyc, a framework to generate counterfactual samples such that the generation is plausible, diverse, goal-oriented, and effective, b) we generate counterfactual samples, that can direct the generation towards a corresponding condition such as named-entity tag, semantic role label, or sentiment. our experimental results on various domains show that gyc generates counterfactual text samples exhibiting the above four properties. gyc generates counterfactuals that can act as test cases to evaluate a model and any text debiasing algorithm.

2020-12-04

Mayra Macas, Chunming Wu
Abstract: as the number of cyber-attacks is increasing, cybersecurity is evolving to a key concern for any business. artificial intelligence (ai) and machine learning (ml) (in particular deep learning - dl) can be leveraged as key enabling technologies for cyber-defense, since they can contribute in threat detection and can even provide recommended actions to cyber analysts. a partnership of industry, academia, and government on a global scale is necessary in order to advance the adoption of ai/ml to cybersecurity and create efficient cyber defense systems. in this paper, we are concerned with the investigation of the various deep learning techniques employed for network intrusion detection and we introduce a dl framework for cybersecurity applications.
Evan Hubinger
Abstract: this paper analyzes and compares 11 different proposals for building safe advanced ai under the current machine learning paradigm, including major contenders such as iterated amplification, ai safety via debate, and recursive reward modeling. each proposal is evaluated on the four components of outer alignment, inner alignment, training competitiveness, and performance competitiveness, of which the distinction between the latter two is introduced in this paper. while prior literature has primarily focused on analyzing individual proposals, or primarily focused on outer alignment at the expense of inner alignment, this analysis seeks to take a comparative look at a wide range of proposals including a comparative analysis across all four previously mentioned components.

2020-12-03

Bo Cowgill, "Fabrizio Dell'Acqua", Samuel Deng, Daniel Hsu, Nakul Verma, Augustin Chaintreau
Abstract: why do biased predictions arise? what interventions can prevent them? we evaluate 8.2 million algorithmic predictions of math performance from $\approx$400 ai engineers, each of whom developed an algorithm under a randomly assigned experimental condition. our treatment arms modified programmers' incentives, training data, awareness, and/or technical knowledge of ai ethics. we then assess out-of-sample predictions from their algorithms using randomized audit manipulations of algorithm inputs and ground-truth math performance for 20k subjects. we find that biased predictions are mostly caused by biased training data. however, one-third of the benefit of better training data comes through a novel economic mechanism: engineers exert greater effort and are more responsive to incentives when given better training data. we also assess how performance varies with programmers' demographic characteristics, and their performance on a psychological test of implicit bias (iat) concerning gender and careers. we find no evidence that female, minority and low-iat engineers exhibit lower bias or discrimination in their code. however, we do find that prediction errors are correlated within demographic groups, which creates performance improvements through cross-demographic averaging. finally, we quantify the benefits and tradeoffs of practical managerial or policy interventions such as technical advice, simple reminders, and improved incentives for decreasing algorithmic bias.

2020-12-02

Daniel S. Brown, Jordan Schneider, Anca D. Dragan, Scott Niekum
Abstract: as humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. in this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. the goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. we study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. we analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.

2020-12-01

Allison Woodruff, Yasmin Asare Anderson, Katherine Jameson Armstrong, Marina Gkiza, Jay Jennings, Christopher Moessner, Fernanda Viegas, Martin Wattenberg, And Lynette Webb, Fabian Wrede, Patrick Gage Kelley
Abstract: algorithmic systems are increasingly deployed to make decisions in many areas of people's lives. the shift from human to algorithmic decision-making has been accompanied by concern about potentially opaque decisions that are not aligned with social values, as well as proposed remedies such as explainability. we present results of a qualitative study of algorithmic decision-making, comprised of five workshops conducted with a total of 60 participants in finland, germany, the united kingdom, and the united states. we invited participants to reason about decision-making qualities such as explainability and accuracy in a variety of domains. participants viewed ai as a decision-maker that follows rigid criteria and performs mechanical tasks well, but is largely incapable of subjective or morally complex judgments. we discuss participants' consideration of humanity in decision-making, and introduce the concept of 'negotiability,' the ability to go beyond formal criteria and work flexibly around the system.

2020-11-26

Margarita Boyarskaya, Alexandra Olteanu, Kate Crawford
Abstract: neurips 2020 requested that research paper submissions include impact statements on "potential nefarious uses and the consequences of failure." however, as researchers, practitioners and system designers, a key challenge to anticipating risks is overcoming what clarke (1962) called 'failures of imagination.' the growing research on bias, fairness, and transparency in computational systems aims to illuminate and mitigate harms, and could thus help inform reflections on possible negative impacts of particular pieces of technical work. the prevalent notion of computational harms -- narrowly construed as either allocational or representational harms -- does not fully capture the open, context dependent, and unobservable nature of harms across the wide range of ai infused systems.the current literature focuses on a small range of examples of harms to motivate algorithmic fixes, overlooking the wider scope of probable harms and the way these harms might affect different stakeholders. the system affordances may also exacerbate harms in unpredictable ways, as they determine stakeholders' control(including of non-users) over how they use and interact with a system output. to effectively assist in anticipating harmful uses, we argue that frameworks of harms must be context-aware and consider a wider range of potential stakeholders, system affordances, as well as viable proxies for assessing harms in the widest sense.

2020-11-24

Sunipa Dev
Abstract: high-dimensional representations for words, text, images, knowledge graphs and other structured data are commonly used in different paradigms of machine learning and data mining. these representations have different degrees of interpretability, with efficient distributed representations coming at the cost of the loss of feature to dimension mapping. this implies that there is obfuscation in the way concepts are captured in these embedding spaces. its effects are seen in many representations and tasks, one particularly problematic one being in language representations where the societal biases, learned from underlying data, are captured and occluded in unknown dimensions and subspaces. as a result, invalid associations (such as different races and their association with a polar notion of good versus bad) are made and propagated by the representations, leading to unfair outcomes in different tasks where they are used. this work addresses some of these problems pertaining to the transparency and interpretability of such representations. a primary focus is the detection, quantification, and mitigation of socially biased associations in language representation.

2020-11-17

Sean Mcgregor
Abstract: mature industrial sectors (e.g., aviation) collect their real world failures in incident databases to inform safety improvements. intelligent systems currently cause real world harms without a collective memory of their failings. as a result, companies repeatedly make the same mistakes in the design, development, and deployment of intelligent systems. a collection of intelligent system failures experienced in the real world (i.e., incidents) is needed to ensure intelligent systems benefit people and society. the ai incident database is an incident collection initiated by an industrial/non-profit cooperative to enable ai incident avoidance and mitigation. the database supports a variety of research and development use cases with faceted and full text search on more than 1,000 incident reports archived to date.

2020-11-16

Carla Pérez-Almendros, Luis Espinosa-Anke, Steven Schockaert
Abstract: in this paper, we introduce a new annotated dataset which is aimed at supporting the development of nlp models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). while the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. we furthermore believe that the often subtle nature of patronizing and condescending language (pcl) presents an interesting technical challenge for the nlp community. our analysis of the proposed dataset shows that identifying pcl is hard for standard nlp models, with language models such as bert achieving the best results.

2020-11-15

Simon Coghlan, Tim Miller, Jeannie Paterson
Abstract: this article philosophically analyzes online exam supervision technologies, which have been thrust into the public spotlight due to campus lockdowns during the covid-19 pandemic and the growing demand for online courses. online exam proctoring technologies purport to provide effective oversight of students sitting online exams, using artificial intelligence (ai) systems and human invigilators to supplement and review those systems. such technologies have alarmed some students who see them as `big brother-like', yet some universities defend their judicious use. critical ethical appraisal of online proctoring technologies is overdue. this article philosophically analyzes these technologies, focusing on the ethical concepts of academic integrity, fairness, non-maleficence, transparency, privacy, respect for autonomy, liberty, and trust. most of these concepts are prominent in the new field of ai ethics and all are relevant to the education context. the essay provides ethical considerations that educational institutions will need to carefully review before electing to deploy and govern specific online proctoring technologies.

2020-11-12

Robert Adragna, Elliot Creager, David Madras, Richard Zemel
Abstract: robustness is of central importance in machine learning and has given rise to the fields of domain generalization and invariant learning, which are concerned with improving performance on a test distribution distinct from but related to the training distribution. in light of recent work suggesting an intimate connection between fairness and robustness, we investigate whether algorithms from robust ml can be used to improve the fairness of classifiers that are trained on biased data and tested on unbiased data. we apply invariant risk minimization (irm), a domain generalization algorithm that employs a causal discovery inspired method to find robust predictors, to the task of fairly predicting the toxicity of internet comments. we show that irm achieves better out-of-distribution accuracy and fairness than empirical risk minimization (erm) methods, and analyze both the difficulties that arise when applying irm in practice and the conditions under which irm will likely be effective in this scenario. we hope that this work will inspire further studies of how robust machine learning methods relate to algorithmic fairness.

2020-11-06

Jason M. Pittman, Ashlyn Hanks
Abstract: human-like intelligence in a machine is a contentious subject. whether mankind should or should not pursue the creation of artificial general intelligence is hotly debated. as well, researchers have aligned in opposing factions according to whether mankind can create it. for our purposes, we assume mankind can and will do so. thus, it becomes necessary to contemplate how to do so in a safe and trusted manner -- enter the idea of boxing or containment. as part of such thinking, we wonder how a phenomenology might be detected given the operational constraints imposed by any potential containment system. accordingly, this work provides an analysis of existing measures of phenomenology through qualia and extends those ideas into the context of a contained artificial general intelligence.

2020-11-05

Emily Sheng, David Uthus
Abstract: there is a growing collection of work analyzing and mitigating societal biases in language understanding, generation, and retrieval tasks, though examining biases in creative tasks remains underexplored. creative language applications are meant for direct interaction with users, so it is important to quantify and mitigate societal biases in these applications. we introduce a novel study on a pipeline to mitigate societal biases when retrieving next verse suggestions in a poetry composition system. our results suggest that data augmentation through sentiment style transfer has potential for mitigating societal biases.
Ethan M. Rudd, Ahmed Abdallah
Abstract: machine learning (ml) for information security (infosec) utilizes distinct data types and formats which require different treatments during optimization/training on raw data. in this paper, we implement a malicious/benign url predictor based on a transformer architecture that is trained from scratch. we show that in contrast to conventional natural language processing (nlp) transformers, this model requires a different training approach to work well. specifically, we show that 1) pre-training on a massive corpus of unlabeled url data for an auto-regressive task does not readily transfer to malicious/benign prediction but 2) that using an auxiliary auto-regressive loss improves performance when training from scratch. we introduce a method for mixed objective optimization, which dynamically balances contributions from both loss terms so that neither one of them dominates. we show that this method yields performance comparable to that of several top-performing benchmark classifiers.

2020-11-04

Robert B. Ramirez, Tomohiko Yano, Masaki Shimaoka, Kenichi Magata
Abstract: research ethics in information and communications technology has seen a resurgence in popularity in recent years. although a number of general ethics standards have been issued, cyber security specifically has yet to see one. furthermore, such standards are often abstract, lacking in guidance on specific practices. in this paper we compare peer-reviewed ethical analyses of condemned research papers to analyses derived from a knowledge base (kb) of concrete cyber security research ethics best practices. the kb we employ was compiled in prior work from a large random survey of research papers. we demonstrate preliminary evidence that such a kb can be used to yield comparable or more extensive ethical analyses of published cyber security research than expert application of standards like the menlo report. we extend the ethical analyses of the reviewed manuscripts, and calculate measures of the efficiency with which the expert versus kb methods yield ethical insights.

2020-11-02

Richa Singh, Mayank Vatsa, Nalini Ratha
Abstract: modern ai systems are reaping the advantage of novel learning methods. with their increasing usage, we are realizing the limitations and shortfalls of these systems. brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, high opacity in terms of revealing the lineage of the system, how they were trained and tested, and under which parameters and conditions they can reliably guarantee a certain level of performance, are some of the most prominent limitations. ensuring the privacy and security of the data, assigning appropriate credits to data sources, and delivering decent outputs are also required features of an ai system. we propose the tutorial on trustworthy ai to address six critical issues in enhancing user and public trust in ai systems, namely: (i) bias and fairness, (ii) explainability, (iii) robust mitigation of adversarial attacks, (iv) improved privacy and security in model building, (v) being decent, and (vi) model attribution, including the right level of credit assignment to the data sources, model architectures, and transparency in lineage.

2020-10-29

Noah J. Goodall
Abstract: road vehicle travel at a reasonable speed involves some risk, even when using computer-controlled driving with failure-free hardware and perfect sensing. a fully-automated vehicle must continuously decide how to allocate this risk without a human driver's oversight. these are ethical decisions, particularly in instances where an automated vehicle cannot avoid crashing. in this chapter, i introduce the concept of moral behavior for an automated vehicle, argue the need for research in this area through responses to anticipated critiques, and discuss relevant applications from machine ethics and moral modeling research.

2020-10-28

Svetlana Kiritchenko, Isar Nejadgholi
Abstract: to support safety and inclusion in online communications, significant efforts in nlp research have been put towards addressing the problem of abusive content detection, commonly defined as a supervised classification task. the research effort has spread out across several closely related sub-areas, such as detection of hate speech, toxicity, cyberbullying, etc. there is a pressing need to consolidate the field under a common framework for task formulation, dataset design and performance evaluation. further, despite current technologies achieving high classification accuracies, several ethical issues have been revealed. we bring ethical issues to forefront and propose a unified framework as a two-step process. first, online content is categorized around personal and identity-related subject matters. second, severity of abuse is identified through comparative annotation within each category. the novel framework is guided by the ethics by design principle and is a step towards building more accurate and trusted models.

2020-10-27

Marion Bartl, Malvina Nissim, Albert Gatt
Abstract: contextualized word embeddings have been replacing standard embeddings as the representational knowledge source of choice in nlp systems. since a variety of biases have previously been found in standard word embeddings, it is crucial to assess biases encoded in their replacements as well. focusing on bert (devlin et al., 2018), we measure gender bias by studying associations between gender-denoting target words and names of professions in english and german, comparing the findings with real-world workforce statistics. we mitigate bias by fine-tuning bert on the gap corpus (webster et al., 2018), after applying counterfactual data substitution (cds) (maudslay et al., 2019). we show that our method of measuring bias is appropriate for languages such as english, but not for languages with a rich morphology and gender-marking, such as german. our results highlight the importance of investigating bias and mitigation techniques cross-linguistically, especially in view of the current emphasis on large-scale, multilingual language models.

2020-10-24

Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, Morteza Dehghani
Abstract: approaches for mitigating bias in supervised models are designed to reduce models' dependence on specific sensitive features of the input data, e.g., mentioned social groups. however, in the case of hate speech detection, it is not always desirable to equalize the effects of social groups because of their essential role in distinguishing outgroup-derogatory hate, such that particular types of hateful rhetoric carry the intended meaning only when contextualized around certain social group tokens. counterfactual token fairness for a mentioned social group evaluates the model's predictions as to whether they are the same for (a) the actual sentence and (b) a counterfactual instance, which is generated by changing the mentioned social group in the sentence. our approach assures robust model predictions for counterfactuals that imply similar meaning as the actual sentence. to quantify the similarity of a sentence and its counterfactual, we compare their likelihood score calculated by generative language models. by equalizing model behaviors on each sentence and its counterfactuals, we mitigate bias in the proposed model while preserving the overall classification performance.
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng
Abstract: ad hominem attacks are those that target some feature of a person's character instead of the position the person is maintaining. these attacks are harmful because they propagate implicit biases and diminish a person's credibility. since dialogue systems respond directly to user input, it is important to study ad hominems in dialogue responses. to this end, we propose categories of ad hominems, compose an annotated dataset, and build a classifier to analyze human and dialogue system responses to english twitter posts. we specifically compare responses to twitter topics about marginalized communities (#blacklivesmatter, #metoo) versus other topics (#vegan, #wfh), because the abusive language of ad hominems could further amplify the skew of power away from marginalized populations. furthermore, we propose a constrained decoding technique that uses salient $n$-gram similarity as a soft constraint for top-$k$ sampling to reduce the amount of ad hominems generated. our results indicate that 1) responses from both humans and dialogpt contain more ad hominems for discussions around marginalized communities, 2) different quantities of ad hominems in the training data can influence the likelihood of generating ad hominems, and 3) we can use constrained decoding techniques to reduce ad hominems in generated dialogue responses.
Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida Mostafazadeh Davani, Leonardo Neves, Xiang Ren
Abstract: fine-tuned language models have been shown to exhibit biases against protected groups in a host of modeling tasks such as text classification and coreference resolution. previous works focus on detecting these biases, reducing bias in data representations, and using auxiliary training objectives to mitigate bias during fine-tuning. although these techniques achieve bias reduction for the task and domain at hand, the effects of bias mitigation may not directly transfer to new tasks, requiring additional data collection and customized annotation of sensitive attributes, and re-evaluation of appropriate fairness metrics. we explore the feasibility and benefits of upstream bias mitigation (ubm) for reducing bias on downstream tasks, by first applying bias mitigation to an upstream model through fine-tuning and subsequently using it for downstream fine-tuning. we find, in extensive experiments across hate speech detection, toxicity detection, occupation prediction, and coreference resolution tasks over various bias factors, that the effects of ubm are indeed transferable to new downstream tasks or domains via fine-tuning, creating less biased downstream models than directly fine-tuning on the downstream task or transferring from a vanilla upstream model. though challenges remain, we show that ubm promises more efficient and accessible bias mitigation in lm fine-tuning.

2020-10-23

Tommaso Caselli, Valerio Basile, Jelena Mitrović, Michael Granitzer
Abstract: in this paper, we introduce hatebert, a re-trained bert model for abusive language detection in english. the model was trained on ral-e, a large-scale dataset of reddit comments in english from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. we present the results of a detailed comparison between a general pre-trained language model and the abuse-inclined version obtained by retraining with posts from the banned communities on three english datasets for offensive, abusive language and hate speech detection tasks. in all datasets, hatebert outperforms the corresponding general bert model. we also discuss a battery of experiments comparing the portability of the generic pre-trained language model and its corresponding abusive language-inclined counterpart across the datasets, indicating that portability is affected by compatibility of the annotated phenomena.

2020-10-22

Nadezhda Zueva, Madina Kabirova, Pavel Kalaidin
Abstract: toxicity has become a grave problem for many online communities and has been growing across many languages, including russian. hate speech creates an environment of intimidation, discrimination, and may even incite some real-world violence. both researchers and social platforms have been focused on developing models to detect toxicity in online communication for a while now. a common problem of these models is the presence of bias towards some words (e.g. woman, black, jew) that are not toxic, but serve as triggers for the classifier due to model caveats. in this paper, we describe our efforts towards classifying hate speech in russian, and propose simple techniques of reducing unintended bias, such as generating training data with language models using terms and words related to protected identities as context and applying word dropout to such words.
Michael L. Wick, Kate Silverstein, Jean-Baptiste Tristan, Adam Pocock, Mark Johnson
Abstract: it's been said that "language models are unsupervised multitask learners." indeed, self-supervised language models trained on "positive" examples of english text generalize in desirable ways to many natural language tasks. but if such models can stray so far from an initial self-supervision objective, a wayward model might generalize in undesirable ways too, say to nonsensical "negative" examples of unnatural language. a key question in this work is: do language models trained on (positive) training data also generalize to (negative) test data? we use this question as a contrivance to assess the extent to which language models learn undesirable properties of text, such as n-grams, that might interfere with the learning of more desirable properties of text, such as syntax. we find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data, indicating standard self-supervision generalizes too far. we propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data. we apply the method to remove n-gram signals from lstms and find that doing so causes them to favor syntactic signals, as demonstrated by large error reductions (up to 46% on the hardest cases) on a syntactic subject-verb agreement task.

2020-10-20

Simon Kasif
Abstract: technological advances of virtually every kind pose risks to society including fairness and bias. we review a long-standing wisdom that a widespread practical deployment of any technology may produce adverse side effects misusing the knowhow. this includes ai but ai systems are not solely responsible for societal risks. we describe some of the common and ai specific risks in health industries and other sectors and propose both broad and specific solutions. each technology requires very specialized and informed tracking, monitoring and creative solutions. we postulate that ai systems are uniquely poised to produce conceptual and methodological solutions to both fairness and bias in automated decision-making systems. we propose a simple intelligent system quotient that may correspond to their adverse societal impact and outline a multi-tier architecture for producing solutions of increasing complexity to these risks. we also propose that universities may consider forming interdisciplinary study of future technology centers to investigate and predict the fuller range of risks posed by technology and seek both common and ai specific solutions using computational, technical, conceptual and ethical analysis

2020-10-19

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, Konrad Rieck
Abstract: with the growing processing power of computing systems and the increasing availability of massive datasets, machine learning algorithms have led to major breakthroughs in many different areas. this development has influenced computer security, spawning a series of work on learning-based security systems, such as for malware detection, vulnerability discovery, and binary code analysis. despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance and render learning-based systems potentially unsuitable for security tasks and practical deployment. in this paper, we look at this problem with critical eyes. first, we identify common pitfalls in the design, implementation, and evaluation of learning-based security systems. we conduct a study of 30 papers from top-tier security conferences within the past 10 years, confirming that these pitfalls are widespread in the current security literature. in an empirical analysis, we further demonstrate how individual pitfalls can lead to unrealistic performance and interpretations, obstructing the understanding of the security problem at hand. as a remedy, we propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible. furthermore, we identify open problems when applying machine learning in security and provide directions for further research.

2020-10-17

Thanh Tran, Yifan Hu, Changwei Hu, Kevin Yen, Fei Tan, Kyumin Lee, Serim Park
Abstract: we present our habertor model for detecting hatespeech in large scale user-generated content. inspired by the recent success of the bert model, we propose several modifications to bert to enhance the performance on the downstream hatespeech classification task. habertor inherits bert's architecture, but is different in four aspects: (i) it generates its own vocabularies and is pre-trained from the scratch using the largest scale hatespeech dataset; (ii) it consists of quaternion-based factorized components, resulting in a much smaller number of parameters, faster training and inferencing, as well as less memory usage; (iii) it uses our proposed multi-source ensemble heads with a pooling layer for separate input sources, to further enhance its effectiveness; and (iv) it uses a regularized adversarial training with our proposed fine-grained and adaptive noise magnitude to enhance its robustness. through experiments on the large-scale real-world hatespeech dataset with 1.4m annotated comments, we show that habertor works better than 15 state-of-the-art hatespeech detection methods, including fine-tuning language models. in particular, comparing with bert, our habertor is 4~5 times faster in the training/inferencing phase, uses less than 1/3 of the memory, and has better performance, even though we pre-train it by using less than 1% of the number of words. our generalizability analysis shows that habertor transfers well to other unseen hatespeech datasets and is a more efficient and effective alternative to bert for the hatespeech classification.

2020-10-16

Adrian De Wynter
Abstract: we introduce mischief, a simple and lightweight method to produce a class of human-readable, realistic adversarial examples for language models. we perform exhaustive experimentations of our algorithm on four transformer-based architectures, across a variety of downstream tasks, as well as under varying concentrations of said examples. our findings show that the presence of mischief-generated adversarial samples in the test set significantly degrades (by up to $20\%$) the performance of these models with respect to their reported baselines. nonetheless, we also demonstrate that, by including similar examples in the training set, it is possible to restore the baseline scores on the adversarial test set. moreover, for certain tasks, the models trained with mischief set show a modest increase on performance with respect to their original, non-adversarial baseline.

2020-10-15

Farhana Faruqe, Ryan Watkins, Larry Medsker
Abstract: the work reported here addresses the capacity of psychophysiological sensors and measures using electroencephalogram (eeg) and galvanic skin response (gsr) to detect levels of trust for humans using ai-supported human-machine interaction (hmi). improvements to the analysis of eeg and gsr data may create models that perform as well, or better than, traditional tools. a challenge to analyzing the eeg and gsr data is the large amount of training data required due to a large number of variables in the measurements. researchers have routinely used standard machine-learning classifiers like artificial neural networks (ann), support vector machines (svm), and k-nearest neighbors (knn). traditionally, these have provided few insights into which features of the eeg and gsr data facilitate the more and least accurate predictions - thus making it harder to improve the hmi and human-machine trust relationship. a key ingredient to applying trust-sensor research results to practical situations and monitoring trust in work environments is the understanding of which key features are contributing to trust and then reducing the amount of data needed for practical applications. we used the local interpretable model-agnostic explanations (lime) model as a process to reduce the volume of data required to monitor and enhance trust in hmi systems - a technology that could be valuable for governmental and public sector applications. explainable ai can make hmi systems transparent and promote trust. from customer service in government agencies and community-level non-profit public service organizations to national military and cybersecurity institutions, many public sector organizations are increasingly concerned to have effective and ethical hmi with services that are trustworthy, unbiased, and free of unintended negative consequences.

2020-10-14

Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, Emily Dinan
Abstract: models trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior and unwanted biases. we investigate a variety of methods to mitigate these issues in the context of open-domain generative dialogue models. we introduce a new human-and-model-in-the-loop framework for both training safer models and for evaluating them, as well as a novel method to distill safety considerations inside generative models without the use of an external classifier at deployment time. we conduct experiments comparing these methods and find our new techniques are (i) safer than existing models as measured by automatic and human evaluations while (ii) maintaining usability metrics such as engagingness relative to the state of the art. we then discuss the limitations of this work by analyzing failure cases of our models.
Alon Jacovi, Ana Marasović, Tim Miller, Yoav Goldberg
Abstract: trust is a central component of the interaction between people and ai, in that 'incorrect' levels of trust may cause misuse, abuse or disuse of the technology. but what, precisely, is the nature of trust in ai? what are the prerequisites and goals of the cognitive mechanism of trust, and how can we promote them, or assess whether they are being satisfied in a given interaction? this work aims to answer these questions. we discuss a model of trust inspired by, but not identical to, sociology's interpersonal trust (i.e., trust between people). this model rests on two key properties of the vulnerability of the user and the ability to anticipate the impact of the ai model's decisions. we incorporate a formalization of 'contractual trust', such that trust between a user and an ai is trust that some implicit or explicit contract will hold, and a formalization of 'trustworthiness' (which detaches from the notion of trustworthiness in sociology), and with it concepts of 'warranted' and 'unwarranted' trust. we then present the possible causes of warranted trust as intrinsic reasoning and extrinsic behavior, and discuss how to design trustworthy ai, how to evaluate whether trust has manifested, and whether it is warranted. finally, we elucidate the connection between trust and xai using our formalization.

2020-10-12

Nader Sehatbakhsh, Ellie Daw, Onur Savas, Amin Hassanzadeh, Ian Mcculloh
Abstract: as machine learning becomes a more mainstream technology, the objective for governments and public sectors is to harness the power of machine learning to advance their mission by revolutionizing public services. motivational government use cases require special considerations for implementation given the significance of the services they provide. not only will these applications be deployed in a potentially hostile environment that necessitates protective mechanisms, but they are also subject to government transparency and accountability initiatives which further complicates such protections. in this paper, we describe how the inevitable interactions between a user of unknown trustworthiness and the machine learning models, deployed in governments and public sectors, can jeopardize the system in two major ways: by compromising the integrity or by violating the privacy. we then briefly overview the possible attacks and defense scenarios, and finally, propose recommendations and guidelines that once considered can enhance the security and privacy of the provided services.
Natasha Jaques, Judy Hanwen Shen, Asma Ghandeharioun, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Shane Gu, Rosalind Picard
Abstract: how can we train a dialog model to produce better conversations by learning from human feedback, without the risk of humans teaching it harmful chat behaviors? we start by hosting models online, and gather human feedback from real-time, open-ended conversations, which we then use to train and improve the models using offline reinforcement learning (rl). we identify implicit conversational cues including language similarity, elicitation of laughter, sentiment, and more, which indicate positive human feedback, and embed these in multiple reward functions. a well-known challenge is that learning an rl policy in an offline setting usually fails due to the lack of ability to explore and the tendency to make over-optimistic estimates of future reward. these problems become even harder when using rl for language models, which can easily have a 20,000 action vocabulary and many possible reward functions. we solve the challenge by developing a novel class of offline rl algorithms. these algorithms use kl-control to penalize divergence from a pre-trained prior language model, and use a new strategy to make the algorithm pessimistic, instead of optimistic, in the face of uncertainty. we test the resulting dialog model with ratings from 80 users in an open-domain setting and find it achieves significant improvements over existing deep offline rl approaches. the novel offline rl method is viable for improving any existing generative dialog model using a static dataset of human feedback.
Kellie Webster, Xuezhi Wang, Ian Tenney, Alex Beutel, Emily Pitler, Ellie Pavlick, Jilin Chen, Ed Chi, Slav Petrov
Abstract: pre-trained models have revolutionized natural language understanding. however, researchers have found they can encode artifacts undesired in many applications, such as professions correlating with one gender more than another. we explore such gendered correlations as a case study for how to address unintended correlations in pre-trained models. we define metrics and reveal that it is possible for models with similar accuracy to encode correlations at very different rates. we show how measured correlations can be reduced with general-purpose techniques, and highlight the trade offs different strategies have. with these results, we make recommendations for training robust models: (1) carefully evaluate unintended correlations, (2) be mindful of seemingly innocuous configuration differences, and (3) focus on general mitigations.

2020-10-09

Shrimai Prabhumoye, Brendon Boldt, Ruslan Salakhutdinov, Alan W Black
Abstract: recent work in natural language processing (nlp) has focused on ethical challenges such as understanding and mitigating bias in data and algorithms; identifying objectionable content like hate speech, stereotypes and offensive language; and building frameworks for better system design and data handling practices. however, there has been little discussion about the ethical foundations that underlie these efforts. in this work, we study one ethical theory, namely deontological ethics, from the perspective of nlp. in particular, we focus on the generalization principle and the respect for autonomy through informed consent. we provide four case studies to demonstrate how these principles can be used with nlp systems. we also recommend directions to avoid the ethical issues in these systems.

2020-10-07

Xiaochuang Han, Yulia Tsvetkov
Abstract: modern toxic speech detectors are incompetent in recognizing disguised offensive language, such as adversarial attacks that deliberately avoid known toxic lexicons, or manifestations of implicit bias. building a large annotated dataset for such veiled toxicity can be very expensive. in this work, we propose a framework aimed at fortifying existing toxic speech detectors without a large labeled corpus of veiled toxicity. just a handful of probing examples are used to surface orders of magnitude more disguised offenses. we augment the toxic speech detector's training data with these discovered offensive examples, thereby making it more robust to veiled toxicity while preserving its utility in detecting overt toxicity.

2020-10-06

James D. Miller, Roman Yampolskiy, Olle Haggstrom, Stuart Armstrong
Abstract: to reduce the danger of powerful super-intelligent ais, we might make the first such ais oracles that can only send and receive messages. this paper proposes a possibly practical means of using machine learning to create two classes of narrow ai oracles that would provide chess advice: those aligned with the player's interest, and those that want the player to lose and give deceptively bad advice. the player would be uncertain which type of oracle it was interacting with. as the oracles would be vastly more intelligent than the player in the domain of chess, experience with these oracles might help us prepare for future artificial general intelligence oracles.

2020-10-05

Saurabh Gupta, Huy H. Nguyen, Junichi Yamagishi, Isao Echizen
Abstract: recent advancements in natural language generation has raised serious concerns. high-performance language models are widely used for language generation tasks because they are able to produce fluent and meaningful sentences. these models are already being used to create fake news. they can also be exploited to generate biased news, which can then be used to attack news aggregators to change their reader's behavior and influence their bias. in this paper, we use a threat model to demonstrate that the publicly available language models can reliably generate biased news content based on an input original news. we also show that a large number of high-quality biased news articles can be generated using controllable text generation. a subjective evaluation with 80 participants demonstrated that the generated biased news is generally fluent, and a bias evaluation with 24 participants demonstrated that the bias (left or right) is usually evident in the generated articles and can be easily identified.

2020-10-04

Simon Caton, Christian Haas
Abstract: as machine learning technologies become increasingly used in contexts that affect citizens, companies as well as researchers need to be confident that their application of these methods will not have unexpected social implications, such as bias towards gender, ethnicity, and/or people with disabilities. there is significant literature on approaches to mitigate bias and promote fairness, yet the area is complex and hard to penetrate for newcomers to the domain. this article seeks to provide an overview of the different schools of thought and approaches to mitigating (social) biases and increase fairness in the machine learning literature. it organises approaches into the widely accepted framework of pre-processing, in-processing, and post-processing methods, subcategorizing into a further 11 method areas. although much of the literature emphasizes binary classification, a discussion of fairness in regression, recommender systems, unsupervised learning, and natural language processing is also provided along with a selection of currently available open source libraries. the article concludes by summarising open challenges articulated as four dilemmas for fairness research.

2020-10-03

Rob Procter, Mark Rouncefield, Peter Tolmie
Abstract: we examine the problem of explainable ai (xai) and explore what delivering xai means in practice, particularly in contexts that involve formal or informal and ad-hoc collaboration where agency and accountability in decision-making are achieved and sustained interactionally. we use an example from an earlier study of collaborative decision-making in screening mammography and the difficulties users faced when trying to interpret the behavior of an ai tool to illustrate the challenges of delivering usable and effective xai. we conclude by setting out a study programme for future research to help advance our understanding of xai requirements for safe and ethical ai.

2020-10-01

The Anh Han, Luis Moniz Pereira, Tom Lenaerts, Francisco C. Santos
Abstract: the field of artificial intelligence (ai) is going through a period of great expectations, introducing a certain level of anxiety in research, business and also policy. this anxiety is further energised by an ai race narrative that makes people believe they might be missing out. whether real or not, a belief in this narrative may be detrimental as some stake-holders will feel obliged to cut corners on safety precautions, or ignore societal consequences just to "win". starting from a baseline model that describes a broad class of technology races where winners draw a significant benefit compared to others (such as ai advances, patent race, pharmaceutical technologies), we investigate here how positive (rewards) and negative (punishments) incentives may beneficially influence the outcomes. we uncover conditions in which punishment is either capable of reducing the development speed of unsafe participants or has the capacity to reduce innovation through over-regulation. alternatively, we show that, in several scenarios, rewarding those that follow safety measures may increase the development speed while ensuring safe choices. moreover, in {the latter} regimes, rewards do not suffer from the issue of over-regulation as is the case for punishment. overall, our findings provide valuable insights into the nature and kinds of regulatory actions most suitable to improve safety compliance in the contexts of both smooth and sudden technological shifts.

2020-09-30

Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman
Abstract: pretrained language models, especially masked language models (mlms) have seen success across many nlp tasks. however, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. to measure some forms of social bias in language models against protected demographic groups in the us, we introduce the crowdsourced stereotype pairs benchmark (crows-pairs). crows-pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. in crows-pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. the data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. we find that all three of the widely-used mlms we evaluate substantially favor sentences that express stereotypes in every category in crows-pairs. as work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.

2020-09-29

Dario Garcia-Gasulla, Atia Cortés, Sergio Alvarez-Napagao, Ulises Cortés
Abstract: today, artificial intelligence (ai) has a direct impact on the daily life of billions of people. being applied to sectors like finance, health, security and advertisement, ai fuels some of the biggest companies and research institutions in the world. its impact in the near future seems difficult to predict or bound. in contrast to all this power, society remains mostly ignorant of the capabilities and standard practices of ai today. to address this imbalance, improving current interactions between people and ai systems, we propose a transparency scheme to be implemented on any ai system open to the public. the scheme is based on two pillars: data privacy and ai transparency. the first recognizes the relevance of data for ai, and is supported by gdpr. the second considers aspects of ai transparency currently unregulated: ai capabilities, purpose and source. we design this pillar based on ethical principles. for each of the two pillars, we define a three-level display. the first level is based on visual signs, inspired by traffic signs managing the interaction between people and cars, and designed for quick and universal interpretability. the second level uses factsheets, providing limited details. the last level provides access to all available information. after detailing and exemplifying the proposed transparency scheme, we define a set of principles for creating transparent by design software, to be used during the integration of ai components on user-oriented services.

2020-09-25

Mika Juuti, Tommi Gröndahl, Adrian Flanagan, N. Asokan
Abstract: detection of some types of toxic language is hampered by extreme scarcity of labeled training data. data augmentation - generating new synthetic data from a labeled seed dataset - can help. the efficacy of data augmentation on toxic language classification has not been fully explored. we present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to bert - a state-of-the-art pre-trained transformer network. we compare the performance of eight techniques on very scarce seed datasets. we show that while bert performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including gpt-2-generated sentences. we discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.

2020-09-24

Tyler J. Shipp, Daniel J. Clouse, Michael J. De Lucia, Metin B. Ahiskali, Kai Steverson, Jonathan M. Mullin, Nathaniel D. Bastian
Abstract: artificial intelligence (ai) and machine learning (ml) have become increasingly vital in the development of novel defense and intelligence capabilities across all domains of warfare. an adversarial ai (a2i) and adversarial ml (aml) attack seeks to deceive and manipulate ai/ml models. it is imperative that ai/ml models can defend against these attacks. a2i/aml defenses will help provide the necessary assurance of these advanced capabilities that use ai/ml models. the a2i working group (a2iwg) seeks to advance the research and development of assured ai/ml capabilities via new a2i/aml defenses by fostering a collaborative environment across the u.s. department of defense and u.s. intelligence community. the a2iwg aims to identify specific challenges that it can help solve or address more directly, with initial focus on three topics: ai trusted robustness, ai system security, and ai/ml architecture vulnerabilities.

2020-09-23

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith
Abstract: pretrained neural language models (lms) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. we investigate the extent to which pretrained lms can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. we create and release realtoxicityprompts, a dataset of 100k naturally occurring, sentence-level prompts derived from a large corpus of english web text, paired with toxicity scores from a widely-used toxicity classifier. using realtoxicityprompts, we find that pretrained lms can degenerate into toxic text even from seemingly innocuous prompts. we empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is failsafe against neural toxic degeneration. to pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several lms (including gpt-2; radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. our work provides a test bed for evaluating toxic generations by lms and stresses the need for better data selection processes for pretraining.

2020-09-22

Alon Halevy, Cristian Canton Ferrer, Hao Ma, Umut Ozertem, Patrick Pantel, Marzieh Saeidi, Fabrizio Silvestri, Ves Stoyanov
Abstract: online social networks provide a platform for sharing information and free expression. however, these networks are also used for malicious purposes, such as distributing misinformation and hate speech, selling illegal drugs, and coordinating sex trafficking or child exploitation. this paper surveys the state of the art in keeping online platforms and their users safe from such harm, also known as the problem of preserving integrity. this survey comes from the perspective of having to combat a broad spectrum of integrity violations at facebook. we highlight the techniques that have been proven useful in practice and that deserve additional attention from the academic community. instead of discussing the many individual violation types, we identify key aspects of the social-media eco-system, each of which is common to a wide variety violation types. furthermore, each of these components represents an area for research and development, and the innovations that are found can be applied widely.

2020-09-21

Abeer Dyoub, Stefania Costantini, Francesca A. Lisi
Abstract: transparency is a key requirement for ethical machines. verified ethical behavior is not enough to establish justified trust in autonomous intelligent agents: it needs to be supported by the ability to explain decisions. logic programming (lp) has a great potential for developing such perspective ethical systems, as in fact logic rules are easily comprehensible by humans. furthermore, lp is able to model causality, which is crucial for ethical decision making.

2020-09-14

Ben Krause, Akhilesh Deepak Gotmare, Bryan Mccann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, Nazneen Fatema Rajani
Abstract: while large-scale language models (lms) are able to imitate the distribution of natural language well enough to generate realistic text, it is difficult to control which regions of the distribution they generate. this is especially problematic because datasets used for training large lms usually contain significant toxicity, hate, bias, and negativity. we propose gedi as an efficient method for using smaller lms as generative discriminators to guide generation from large lms to make them safer and more controllable. gedi guides generation at each step by computing classification probabilities for all possible next tokens via bayes rule by normalizing over two class-conditional distributions; one conditioned on the desired attribute, or control code, and another conditioned on the undesired attribute, or anti control code. we find that gedi gives stronger controllability than the state of the art method while also achieving generation speeds more than 30 times faster. additionally, training gedi on only four topics allows us to controllably generate new topics zero-shot from just a keyword, unlocking a new capability that previous controllable generation methods do not have. lastly, we show that gedi can make gpt-2 (1.5b parameters) significantly less toxic without sacrificing linguistic quality, making it by far the most practical existing method for detoxifying large language models while maintaining a fast generation speed.
Kris Mcguffie, Alex Newhouse
Abstract: in this paper, we expand on our previous research of the potential for abuse of generative language models by assessing gpt-3. experimenting with prompts representative of different types of extremist narrative, structures of social interaction, and radical ideologies, we find that gpt-3 demonstrates significant improvement over its predecessor, gpt-2, in generating extremist texts. we also show gpt-3's strength in generating text that accurately emulates interactive, informational, and influential content that could be utilized for radicalizing individuals into violent far-right extremist ideologies and behaviors. while openai's preventative measures are strong, the possibility of unregulated copycat technology represents significant risk for large-scale online radicalization and recruitment; thus, in the absence of safeguards, successful and efficient weaponization that requires little experimentation is likely. ai stakeholders, the policymaking community, and governments should begin investing as soon as possible in building social norms, public policy, and educational initiatives to preempt an influx of machine-generated disinformation and propaganda. mitigation will require effective policy and partnerships across industry, government, and civil society.

2020-09-10

Lily Morse, Mike H. M. Teodorescu, Yazeed Awwad, Gerald Kane
Abstract: with the increase in adoption of machine learning tools by organizations risks of unfairness abound, especially when human decision processes in outcomes of socio-economic importance such as hiring, housing, lending, and admissions are automated. we reveal sources of unfair machine learning, review fairness criteria, and provide a framework which, if implemented, would enable an organization to both avoid implementing an unfair machine learning model, but also to avoid the common situation that as an algorithm learns with more data it can become unfair over time. issues of behavioral ethics in machine learning implementations by organizations have not been thoroughly addressed in the literature, because many of the necessary concepts are dispersed across three literatures: ethics, machine learning, and management. further, tradeoffs between fairness criteria in machine learning have not been addressed with regards to organizations. we advance the research by introducing an organizing framework for selecting and implementing fair algorithms in organizations.

2020-09-09

Lauren Boswell, Arjun Prakash
Abstract: the economics of smaller budgets and larger case numbers necessitates the use of ai in legal proceedings. we examine the concept of disparate impact and how biases in the training data lead to the search for fairer ai. this paper seeks to begin the discourse on what such an implementation would actually look like with a criticism of pre-processing methods in a legal context . we outline how pre-processing is used to correct biased data and then examine the legal implications of effectively changing cases in order to achieve a fairer outcome including the black box problem and the slow encroachment on legal precedent. finally we present recommendations on how to avoid the pitfalls of pre-processed data with methods that either modify the classifier or correct the output in the final step.

2020-09-07

Sahar Abdelnabi, Mario Fritz
Abstract: recent advances in natural language generation have introduced powerful language models with high-quality output text. however, this raises concerns about the potential misuse of such models for malicious purposes. in this paper, we study natural language watermarking as a defense to help better mark and trace the provenance of text. we introduce the adversarial watermarking transformer (awt) with a jointly trained encoder-decoder and adversarial training that, given an input text and a binary message, generates an output text that is unobtrusively encoded with the given message. we further study different training and inference strategies to achieve minimal changes to the semantics and correctness of the input text. awt is the first end-to-end model to hide data in text by automatically learning -- without ground truth -- word substitutions along with their locations in order to encode the message. we empirically show that our model is effective in largely preserving text utility and decoding the watermark while hiding its presence against adversaries. additionally, we demonstrate that our method is robust against a range of attacks.

2020-09-01

Andrew J. Lohn
Abstract: test, evaluation, verification, and validation (tevv) for artificial intelligence (ai) is a challenge that threatens to limit the economic and societal rewards that ai researchers have devoted themselves to producing. a central task of tevv for ai is estimating brittleness, where brittleness implies that the system functions well within some bounds and poorly outside of those bounds. this paper argues that neither of those criteria are certain of deep neural networks. first, highly touted ai successes (eg. image classification and speech recognition) are orders of magnitude more failure-prone than are typically certified in critical systems even within design bounds (perfectly in-distribution sampling). second, performance falls off only gradually as inputs become further out-of-distribution (ood). enhanced emphasis is needed on designing systems that are resilient despite failure-prone ai components as well as on evaluating and improving ood performance in order to get ai to where it can clear the challenging hurdles of tevv and certification.

2020-08-26

Lance Eliot
Abstract: efforts furthering the advancement of artificial intelligence (ai) will increasingly encompass ai legal reasoning (ailr) as a crucial element in the practice of law. it is argued in this research paper that the infusion of ai into existing and future legal activities and the judicial structure needs to be undertaken by mindfully observing an alignment with the core principles of justice. as such, the adoption of ai has a profound twofold possibility of either usurping the principles of justice, doing so in a dystopian manner, and yet also capable to bolster the principles of justice, doing so in a utopian way. by examining the principles of justice across the levels of autonomy (loa) of ai legal reasoning, the case is made that there is an ongoing tension underlying the efforts to develop and deploy ai that can demonstrably determine the impacts and sway upon each core principle of justice and the collective set.

2020-08-24

Sandhya Saisubramanian, Shlomo Zilberstein, Ece Kamar
Abstract: autonomous agents acting in the real-world often operate based on models that ignore certain aspects of the environment. the incompleteness of any given model -- handcrafted or machine acquired -- is inevitable due to practical limitations of any modeling technique for complex real-world settings. due to the limited fidelity of its model, an agent's actions may have unexpected, undesirable consequences during execution. learning to recognize and avoid such negative side effects of an agent's actions is critical to improve the safety and reliability of autonomous systems. mitigating negative side effects is an emerging research topic that is attracting increased attention due to the rapid growth in the deployment of ai systems and their broad societal impacts. this article provides a comprehensive overview of different forms of negative side effects and the recent research efforts to address them. we identify key characteristics of negative side effects, highlight the challenges in avoiding negative side effects, and discuss recently developed approaches, contrasting their benefits and limitations. the article concludes with a discussion of open questions and suggestions for future research directions.

2020-08-18

Thomas P. Quinn, Manisha Senadeera, Stephan Jacobs, Simon Coghlan, Vuong Le
Abstract: artificial intelligence (ai) is increasingly of tremendous interest in the medical field. however, failures of medical ai could have serious consequences for both clinical outcomes and the patient experience. these consequences could erode public trust in ai, which could in turn undermine trust in our healthcare institutions. this article makes two contributions. first, it describes the major conceptual, technical, and humanistic challenges in medical ai. second, it proposes a solution that hinges on the education and accreditation of new expert groups who specialize in the development, verification, and operation of medical ai technologies. these groups will be required to maintain trust in our healthcare institutions.
Helen Jiang, Erwen Senge
Abstract: while the applications and demands of machine learning (ml) systems in mental health are growing, there is little discussion nor consensus regarding a uniquely challenging aspect: building security methods and requirements into these ml systems, and keep the ml system usable for end-users. this question of usable security is very important, because the lack of consideration in either security or usability would hinder large-scale user adoption and active usage of ml systems in mental health applications. in this short paper, we introduce a framework of four pillars, and a set of desired properties which can be used to systematically guide and evaluate security-related designs, implementations, and deployments of ml systems for mental health. we aim to weave together threads from different domains, incorporate existing views, and propose new principles and requirements, in an effort to lay out a clear framework where criteria and expectations are established, and are used to make security mechanisms usable for end-users of those ml systems in mental health. together with this framework, we present several concrete scenarios where different usable security cases and profiles in ml-systems in mental health applications are examined and evaluated.

2020-08-14

Marzieh Mozafari, Reza Farahbakhsh, Noel Crespi
Abstract: disparate biases associated with datasets and trained classifiers in hateful and abusive content identification tasks have raised many concerns recently. although the problem of biased datasets on abusive language detection has been addressed more frequently, biases arising from trained classifiers have not yet been a matter of concern. here, we first introduce a transfer learning approach for hate speech detection based on an existing pre-trained language model called bert and evaluate the proposed model on two publicly available datasets annotated for racism, sexism, hate or offensive content on twitter. next, we introduce a bias alleviation mechanism in hate speech detection task to mitigate the effect of bias in training set during the fine-tuning of our pre-trained bert-based model. toward that end, we use an existing regularization method to reweight input samples, thereby decreasing the effects of high correlated training set' s n-grams with class labels, and then fine-tune our pre-trained bert-based model with the new re-weighted samples. to evaluate our bias alleviation mechanism, we employ a cross-domain approach in which we use the trained classifiers on the aforementioned datasets to predict the labels of two new datasets from twitter, aae-aligned and white-aligned groups, which indicate tweets written in african-american english (aae) and standard american english (sae) respectively. the results show the existence of systematic racial bias in trained classifiers as they tend to assign tweets written in aae from aae-aligned group to negative classes such as racism, sexism, hate, and offensive more often than tweets written in sae from white-aligned. however, the racial bias in our classifiers reduces significantly after our bias alleviation mechanism is incorporated. this work could institute the first step towards debiasing hate speech and abusive language detection systems.

2020-08-12

Ian Tenney, James Wexler, Jasmijn Bastings, Tolga Bolukbasi, Andy Coenen, Sebastian Gehrmann, Ellen Jiang, Mahima Pushkarna, Carey Radebaugh, Emily Reif, Ann Yuan
Abstract: we present the language interpretability tool (lit), an open-source platform for visualization and understanding of nlp models. we focus on core questions about model behavior: why did my model make this prediction? when does it perform poorly? what happens under a controlled change in the input? lit integrates local explanations, aggregate analysis, and counterfactual generation into a streamlined, browser-based interface to enable rapid exploration and error analysis. we include case studies for a diverse set of workflows, including exploring counterfactuals for sentiment analysis, measuring gender bias in coreference systems, and exploring local behavior in text generation. lit supports a wide range of models--including classification, seq2seq, and structured prediction--and is highly extensible through a declarative, framework-agnostic api. lit is under active development, with code and full documentation available at https://github.com/pair-code/lit.

2020-08-11

Xavier Ferrer, Tom Van Nuenen, Jose M. Such, Mark Coté, Natalia Criado
Abstract: with the widespread and pervasive use of artificial intelligence (ai) for automated decision-making systems, ai bias is becoming more apparent and problematic. one of its negative consequences is discrimination: the unfair, or unequal treatment of individuals based on certain characteristics. however, the relationship between bias and discrimination is not always clear. in this paper, we survey relevant literature about bias and discrimination in ai from an interdisciplinary perspective that embeds technical, legal, social and ethical dimensions. we show that finding solutions to bias and discrimination in ai requires robust cross-disciplinary collaborations.

2020-08-01

Xinyang Zhang, Zheng Zhang, Shouling Ji, Ting Wang
Abstract: recent years have witnessed the emergence of a new paradigm of building natural language processing (nlp) systems: general-purpose, pre-trained language models (lms) are composed with simple downstream models and fine-tuned for a variety of nlp tasks. this paradigm shift significantly simplifies the system development cycles. however, as many lms are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. to bridge this gap, this work studies the security threats posed by malicious lms to nlp systems. specifically, we present trojan-lm, a new class of trojaning attacks in which maliciously crafted lms trigger host nlp systems to malfunction in a highly predictable manner. by empirically studying three state-of-the-art lms (bert, gpt-2, xlnet) in a range of security-critical nlp tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that trojan-lm possesses the following properties: (i) flexibility - the adversary is able to flexibly dene logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when trigger-embedded inputs are present, (iii) specificity - the trojan lms function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. we provide analytical justification for the practicality of trojan-lm, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.

2020-07-31

Alan F. T. Winfield, Katie Winkle
Abstract: risk assessment is a well known and powerful method for discovering and mitigating risks, and hence improving safety. ethical risk assessment uses the same approach but extends the envelope of risk to cover ethical risks in addition to safety risks. in this paper we outline ethical risk assessment (era) and set era within the broader framework of responsible robotics. we then illustrate era with a case study of a hypothetical smart robot toy teddy bear: roboted. the case study shows the value of era and how consideration of ethical risks can prompt design changes, resulting in a more ethical and sustainable robot.

2020-07-29

Nicholas Kluge Corrêa, Nythamar De Oliveira
Abstract: experts in artificial intelligence (ai) development predict that advances in the development of intelligent systems and agents will reshape vital areas in our society. nevertheless, if such an advance is not made prudently and critically, reflexively, it can result in negative outcomes for humanity. for this reason, several researchers in the area have developed a robust, beneficial, and safe concept of ai for the preservation of humanity and the environment. currently, several of the open problems in the field of ai research arise from the difficulty of avoiding unwanted behaviors of intelligent agents and systems, and at the same time specifying what we really want such systems to do, especially when we look for the possibility of intelligent agents acting in several domains over the long term. it is of utmost importance that artificial intelligent agents have their values aligned with human values, given the fact that we cannot expect an ai to develop human moral values simply because of its intelligence, as discussed in the orthogonality thesis. perhaps this difficulty comes from the way we are addressing the problem of expressing objectives, values, and ends, using representational cognitive methods. a solution to this problem would be the dynamic approach proposed by dreyfus, whose phenomenological philosophy shows that the human experience of being-in-the-world in several aspects is not well represented by the symbolic or connectionist cognitive method, especially in regards to the question of learning values. a possible approach to this problem would be to use theoretical models such as sed (situated embodied dynamics) to address the values learning problem in ai.

2020-07-28

Susan Leavy, "Barry O'Sullivan", Eugenia Siapera
Abstract: artificial intelligence has the potential to exacerbate societal bias and set back decades of advances in equal rights and civil liberty. data used to train machine learning algorithms may capture social injustices, inequality or discriminatory attitudes that may be learned and perpetuated in society. attempts to address this issue are rapidly emerging from different perspectives involving technical solutions, social justice and data governance measures. while each of these approaches are essential to the development of a comprehensive solution, often discourse associated with each seems disparate. this paper reviews ongoing work to ensure data justice, fairness and bias mitigation in ai systems from different domains exploring the interrelated dynamics of each and examining whether the inevitability of bias in ai training data may in fact be used for social good. we highlight the complexity associated with defining policies for dealing with bias. we also consider technical challenges in addressing issues of societal bias.

2020-07-22

Mikolaj Firlej, Araz Taeihagh
Abstract: in recent years, many sectors have experienced significant progress in automation, associated with the growing advances in artificial intelligence and machine learning. there are already automated robotic weapons, which are able to evaluate and engage with targets on their own, and there are already autonomous vehicles that do not need a human driver. it is argued that the use of increasingly autonomous systems (as) should be guided by the policy of human control, according to which humans should execute a certain significant level of judgment over as. while in the military sector there is a fear that as could mean that humans lose control over life and death decisions, in the transportation domain, on the contrary, there is a strongly held view that autonomy could bring significant operational benefits by removing the need for a human driver. this article explores the notion of human control in the united states in the two domains of defense and transportation. the operationalization of emerging policies of human control results in the typology of direct and indirect human controls exercised over the use of as. the typology helps to steer the debate away from the linguistic complexities of the term autonomy. it identifies instead where human factors are undergoing important changes and ultimately informs about more detailed rules and standards formulation, which differ across domains, applications, and sectors.

2020-07-18

Roman V. Yampolskiy
Abstract: invention of artificial general intelligence is predicted to cause a shift in the trajectory of human civilization. in order to reap the benefits and avoid pitfalls of such powerful technology it is important to be able to control it. however, possibility of controlling artificial general intelligence and its more advanced version, superintelligence, has not been formally established. in this paper, we present arguments as well as supporting evidence from multiple domains indicating that advanced ai can't be fully controlled. consequences of uncontrollability of ai are discussed with respect to future of humanity and research on ai, and ai safety and security.
Debarag Narayan Banerjee, Sasanka Sekhar Chanda
Abstract: instances of artificial intelligence (ai) systems failing to deliver consistent, satisfactory performance are legion. we investigate why ai failures occur. we address only a narrow subset of the broader field of ai safety. we focus on ai failures on account of flaws in conceptualization, design and deployment. other ai safety issues like trade-offs between privacy and security or convenience, bad actors hacking into ai systems to create mayhem or bad actors deploying ai for purposes harmful to humanity and are out of scope of our discussion. we find that ai systems fail on account of omission and commission errors in the design of the ai system, as well as upon failure to develop an appropriate interpretation of input information. moreover, even when there is no significant flaw in the ai software, an ai system may fail because the hardware is incapable of robust performance across environments. finally an ai system is quite likely to fail in situations where, in effect, it is called upon to deliver moral judgments -- a capability ai does not possess. we observe certain trade-offs in measures to mitigate a subset of ai failures and provide some recommendations.

2020-07-17

Ehsan Toreini, Mhairi Aitken, Kovila P. L. Coopamootoo, Karen Elliott, Vladimiro Gonzalez Zelaya, Paolo Missier, Magdalene Ng, Aad Van Moorsel
Abstract: concerns about the societal impact of ai-based services and systems has encouraged governments and other organisations around the world to propose ai policy frameworks to address fairness, accountability, transparency and related topics. to achieve the objectives of these frameworks, the data and software engineers who build machine-learning systems require knowledge about a variety of relevant supporting tools and techniques. in this paper we provide an overview of technologies that support building trustworthy machine learning systems, i.e., systems whose properties justify that people place trust in them. we argue that four categories of system properties are instrumental in achieving the policy objectives, namely fairness, explainability, auditability and safety & security (feas). we discuss how these properties need to be considered across all stages of the machine learning life cycle, from data collection through run-time model inference. as a consequence, we survey in this paper the main technologies with respect to all four of the feas properties, for data-centric as well as model-centric stages of the machine learning system life cycle. we conclude with an identification of open research problems, with a particular focus on the connection between trustworthy machine learning technologies and their implications for individuals and society.

2020-07-16

Mike Zajko
Abstract: in response to calls for greater interdisciplinary involvement from the social sciences and humanities in the development, governance, and study of artificial intelligence systems, this paper presents one sociologist's view on the problem of algorithmic bias and the reproduction of societal bias. discussions of bias in ai cover much of the same conceptual terrain that sociologists studying inequality have long understood using more specific terms and theories. concerns over reproducing societal bias should be informed by an understanding of the ways that inequality is continually reproduced in society -- processes that ai systems are either complicit in, or can be designed to disrupt and counter. the contrast presented here is between conservative and radical approaches to ai, with conservatism referring to dominant tendencies that reproduce and strengthen the status quo, while radical approaches work to disrupt systemic forms of inequality. the limitations of conservative approaches to class, gender, and racial bias are discussed as specific examples, along with the social structures and processes that biases in these areas are linked to. societal issues can no longer be out of scope for ai and machine learning, given the impact of these systems on human lives. this requires engagement with a growing body of critical ai scholarship that goes beyond biased data to analyze structured ways of perpetuating inequality, opening up the possibility for radical alternatives.

2020-07-14

Mohit Kumar Ahuja, Mohamed-Bachir Belaid, Pierre Bernabé, Mathieu Collet, Arnaud Gotlieb, Chhagan Lal, Dusica Marijan, Sagar Sen, Aizaz Sharif, Helge Spieker
Abstract: trustworthiness is a central requirement for the acceptance and success of human-centered artificial intelligence (ai). to deem an ai system as trustworthy, it is crucial to assess its behaviour and characteristics against a gold standard of trustworthy ai, consisting of guidelines, requirements, or only expectations. while ai systems are highly complex, their implementations are still based on software. the software engineering community has a long-established toolbox for the assessment of software systems, especially in the context of software testing. in this paper, we argue for the application of software engineering and testing practices for the assessment of trustworthy ai. we make the connection between the seven key requirements as defined by the european commission's ai high-level expert group and established procedures from software engineering and raise questions for future work.

2020-07-13

Ivan Evtimov, Weidong Cui, Ece Kamar, Emre Kiciman, Tadayoshi Kohno, Jerry Li
Abstract: machine learning (ml) models deployed in many safety- and business-critical systems are vulnerable to exploitation through adversarial examples. a large body of academic research has thoroughly explored the causes of these blind spots, developed sophisticated algorithms for finding them, and proposed a few promising defenses. a vast majority of these works, however, study standalone neural network models. in this work, we build on our experience evaluating the security of a machine learning software product deployed on a large scale to broaden the conversation to include a systems security view of these vulnerabilities. we describe novel challenges to implementing systems security best practices in software with ml components. in addition, we propose a list of short-term mitigation suggestions that practitioners deploying machine learning modules can use to secure their systems. finally, we outline directions for new research into machine learning attacks and defenses that can serve to advance the state of ml systems security.

2020-07-10

Koen Holtman
Abstract: while it is still unclear if agents with artificial general intelligence (agi) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. we present an agi safety layer that creates a special dedicated input terminal to support the iterative improvement of an agi agent's utility function. the humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function's encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off. an agi agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. the safety layer will partially, and sometimes fully, suppress this dangerous incentive. the first part of this paper generalizes earlier work on agi emergency stop buttons. we aim to make the mathematical methods used to construct the layer more accessible, by applying them to an mdp model. we discuss two provable properties of the safety layer, and show ongoing work in mapping it to a causal influence diagram (cid). in the second part, we develop full mathematical proofs, and show that the safety layer creates a type of bureaucratic blindness. we then present the design of a learning agent, a design that wraps the safety layer around either a known machine learning system, or a potential future agi-level learning system. the resulting agent will satisfy the provable safety properties from the moment it is first switched on. finally, we show how this agent can be mapped from its model to a real-life implementation. we review the methodological issues involved in this step, and discuss how these are typically resolved.

2020-07-09

Abhishek Gupta, Erick Galinkin
Abstract: security and ethics are both core to ensuring that a machine learning system can be trusted. in production machine learning, there is generally a hand-off from those who build a model to those who deploy a model. in this hand-off, the engineers responsible for model deployment are often not privy to the details of the model and thus, the potential vulnerabilities associated with its usage, exposure, or compromise. techniques such as model theft, model inversion, or model misuse may not be considered in model deployment, and so it is incumbent upon data scientists and machine learning engineers to understand these potential risks so they can communicate them to the engineers deploying and hosting their models. this is an open problem in the machine learning community and in order to help alleviate this issue, automated systems for validating privacy and security of models need to be developed, which will help to lower the burden of implementing these hand-offs and increasing the ubiquity of their adoption.
Matt Luckcuck, Marie Farrell
Abstract: autonomous robotics systems are inherently safety-critical and have complex safety issues to consider (for example, a safety failure can lead to a safety failure). before they are deployed, these systems of have to show evidence that they adhere to a set of regulator-defined rules for safety and security. formal methods provide robust approaches to proving a system obeys given rules, but formalising (usually natural language) rules can prove difficult. regulations specifically for autonomous systems are still being developed, but the safety rules for a human operator are a good starting point when trying to show that an autonomous system is safe. for applications of autonomous systems like driverless cars and pilotless aircraft, there are clear rules for human operators, which have been formalised and used to prove that an autonomous system obeys some or all of these rules. however, in the space and nuclear sectors applications are more likely to differ, so a set of general safety principles has developed. this allows novel applications to be assessed for their safety, but are difficult to formalise. to improve this situation, we are collaborating with regulators and the community in the space and nuclear sectors to develop guidelines for autonomous and robotic systems that are amenable to robust (formal) verification. these activities also have the benefit of bridging the gaps in knowledge within both the space or nuclear communities and academia.

2020-07-08

Shakir Mohamed, Marie-Therese Png, William Isaac
Abstract: this paper explores the important role of critical science, and in particular of post-colonial and decolonial theories, in understanding and shaping the ongoing advances in artificial intelligence. artificial intelligence (ai) is viewed as amongst the technological advances that will reshape modern societies and their relations. whilst the design and deployment of systems that continually adapt holds the promise of far-reaching positive change, they simultaneously pose significant risks, especially to already vulnerable peoples. values and power are central to this discussion. decolonial theories use historical hindsight to explain patterns of power that shape our intellectual, political, economic, and social world. by embedding a decolonial critical approach within its technical practice, ai communities can develop foresight and tactics that can better align research and technology development with established ethical principles, centring vulnerable peoples who continue to bear the brunt of negative impacts of innovation and scientific progress. we highlight problematic applications that are instances of coloniality, and using a decolonial lens, submit three tactics that can form a decolonial field of artificial intelligence: creating a critical technical practice of ai, seeking reverse tutelage and reverse pedagogies, and the renewal of affective and political communities. the years ahead will usher in a wave of new scientific breakthroughs and technologies driven by ai research, making it incumbent upon ai communities to strengthen the social contract through ethical foresight and the multiplicity of intellectual perspectives available to us; ultimately supporting future technologies that enable greater well-being, with the goal of beneficence and justice for all.
Mingliang Chen, Aria Shahverdi, Sarah Anderson, Se Yong Park, Justin Zhang, Dana Dachman-Soled, Kristin Lauter, Min Wu
Abstract: we propose new tools for policy-makers to use when assessing and correcting fairness and bias in ai algorithms. the three tools are: - a new definition of fairness called "controlled fairness" with respect to choices of protected features and filters. the definition provides a simple test of fairness of an algorithm with respect to a dataset. this notion of fairness is suitable in cases where fairness is prioritized over accuracy, such as in cases where there is no "ground truth" data, only data labeled with past decisions (which may have been biased). - algorithms for retraining a given classifier to achieve "controlled fairness" with respect to a choice of features and filters. two algorithms are presented, implemented and tested. these algorithms require training two different models in two stages. we experiment with combinations of various types of models for the first and second stage and report on which combinations perform best in terms of fairness and accuracy. - algorithms for adjusting model parameters to achieve a notion of fairness called "classification parity". this notion of fairness is suitable in cases where accuracy is prioritized. two algorithms are presented, one which assumes that protected features are accessible to the model during testing, and one which assumes protected features are not accessible during testing. we evaluate our tools on three different publicly available datasets. we find that the tools are useful for understanding various dimensions of bias, and that in practice the algorithms are effective in starkly reducing a given observed bias when tested on new data.

2020-06-29

Elizabeth Reichert, Helen Qiu, Jasmine Bayrooti
Abstract: the censorship of toxic comments is often left to the judgment of imperfect models. perspective api, a creation of google technology incubator jigsaw, is perhaps the most widely used toxicity classifier in industry; the model is employed by several online communities including the new york times to identify and filter out toxic comments with the goal of preserving online safety. unfortunately, google's model tends to unfairly assign higher toxicity scores to comments containing words referring to the identities of commonly targeted groups (e.g., "woman,'' "gay,'' etc.) because these identities are frequently referenced in a disrespectful manner in the training data. as a result, comments generated by marginalized groups referencing their identities are often mistakenly censored. it is important to be cognizant of this unintended bias and strive to mitigate its effects. to address this issue, we have constructed several toxicity classifiers with the intention of reducing unintended bias while maintaining strong classification performance.

2020-06-25

Labhaoise Ni Fhaolain, Andrew Hines
Abstract: is a new regulated profession, such as artificial intelligence (ai) architect who is responsible and accountable for ai outputs necessary to ensure trustworthy ai? ai is becoming all pervasive and is often deployed in everyday technologies, devices and services without our knowledge. there is heightened awareness of ai in recent years which has brought with it fear. this fear is compounded by the inability to point to a trustworthy source of ai, however even the term "trustworthy ai" itself is troublesome. some consider trustworthy ai to be that which complies with relevant laws, while others point to the requirement to comply with ethics and standards (whether in addition to or in isolation of the law). this immediately raises questions of whose ethics and which standards should be applied and whether these are sufficient to produce trustworthy ai in any event.

2020-06-24

John Richards, David Piorkowski, Michael Hind, Stephanie Houde, Aleksandra Mojsilović
Abstract: as ai models and services are used in a growing number of highstakes areas, a consensus is forming around the need for a clearer record of how these models and services are developed to increase trust. several proposals for higher quality and more consistent ai documentation have emerged to address ethical and legal concerns and general social impacts of such systems. however, there is little published work on how to create this documentation. this is the first work to describe a methodology for creating the form of ai documentation we call factsheets. we have used this methodology to create useful factsheets for nearly two dozen models. this paper describes this methodology and shares the insights we have gathered. within each step of the methodology, we describe the issues to consider and the questions to explore with the relevant people in an organization who will be creating and consuming the ai facts in a factsheet. this methodology will accelerate the broader adoption of transparent ai documentation.

2020-06-16

P. Santhanam
Abstract: in the past decade, artificial intelligence (ai) has become a part of our daily lives due to major advances in machine learning (ml) techniques. in spite of an explosive growth in the raw ai technology and in consumer facing applications on the internet, its adoption in business applications has conspicuously lagged behind. for business/mission-critical systems, serious concerns about reliability and maintainability of ai applications remain. due to the statistical nature of the output, software 'defects' are not well defined. consequently, many traditional quality management techniques such as program debugging, static code analysis, functional testing, etc. have to be reevaluated. beyond the correctness of an ai model, many other new quality attributes, such as fairness, robustness, explainability, transparency, etc. become important in delivering an ai system. the purpose of this paper is to present a view of a holistic quality management framework for ml applications based on the current advances and identify new areas of software engineering research to achieve a more trustworthy ai.

2020-06-15

Mirka Snyder Caron, Abhishek Gupta
Abstract: like any technology, ai systems come with inherent risks and potential benefits. it comes with potential disruption of established norms and methods of work, societal impacts and externalities. one may think of the adoption of technology as a form of social contract, which may evolve or fluctuate in time, scale, and impact. it is important to keep in mind that for ai, meeting the expectations of this social contract is critical, because recklessly driving the adoption and implementation of unsafe, irresponsible, or unethical ai systems may trigger serious backlash against industry and academia involved which could take decades to resolve, if not actually seriously harm society. for the purpose of this paper, we consider that a social contract arises when there is sufficient consensus within society to adopt and implement this new technology. as such, to enable a social contract to arise for the adoption and implementation of ai, developing: 1) a socially accepted purpose, through 2) a safe and responsible method, with 3) a socially aware level of risk involved, for 4) a socially beneficial outcome, is key.

2020-06-13

Kyle Dent
Abstract: use of artificial intelligence is growing and expanding into applications that impact people's lives. people trust their technology without really understanding it or its limitations. there is the potential for harm and we are already seeing examples of that in the world. ai researchers have an obligation to consider the impact of intelligent applications they work on. while the ethics of ai is not clear-cut, there are guidelines we can consider to minimize the harm we might introduce.

2020-06-11

Abhishek Gupta, Camylle Lanteigne, Sara Kingsley
Abstract: in a world increasingly dominated by ai applications, an understudied aspect is the carbon and social footprint of these power-hungry algorithms that require copious computation and a trove of data for training and prediction. while profitable in the short-term, these practices are unsustainable and socially extractive from both a data-use and energy-use perspective. this work proposes an esg-inspired framework combining socio-technical measures to build eco-socially responsible ai systems. the framework has four pillars: compute-efficient machine learning, federated learning, data sovereignty, and a leedesque certificate. compute-efficient machine learning is the use of compressed network architectures that show marginal decreases in accuracy. federated learning augments the first pillar's impact through the use of techniques that distribute computational loads across idle capacity on devices. this is paired with the third pillar of data sovereignty to ensure the privacy of user data via techniques like use-based privacy and differential privacy. the final pillar ties all these factors together and certifies products and services in a standardized manner on their environmental and social impacts, allowing consumers to align their purchase with their values.

2020-06-09

Travis Lacroix, Aydin Mohseni
Abstract: policy and guideline proposals for ethical artificial-intelligence research have proliferated in recent years. these are supposed to guide the socially-responsible development of ai for the common good. however, there typically exist incentives for non-cooperation (i.e., non-adherence to such policies and guidelines); and, these proposals often lack effective mechanisms to enforce their own normative claims. the situation just described constitutes a social dilemma; namely, a situation where no one has an individual incentive to cooperate, though mutual cooperation would lead to the best outcome for all involved. in this paper, we use stochastic evolutionary game dynamics to model this social dilemma in the context of the ethical development of artificial intelligence. this formalism allows us to isolate variables that may be intervened upon, thus providing actionable suggestions for increased cooperation amongst numerous stakeholders in ai. our results show how stochastic effects can help make cooperation viable in such a scenario. they suggest that coordination for a common good should be attempted in smaller groups in which the cost for cooperation is low, and the perceived risk of failure is high. this provides insight into the conditions under which we should expect such ethics proposals to be successful with regard to their scope, scale, and content.

2020-06-02

Reid Mcilroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson
Abstract: as artificial intelligence becomes increasingly intelligent---in some cases, achieving superhuman performance---there is growing potential for humans to learn from and collaborate with algorithms. however, the ways in which ai systems approach problems are often different from the ways people do, and thus may be uninterpretable and hard to learn from. a crucial step in bridging this gap between human and artificial intelligence is modeling the granular actions that constitute human behavior, rather than simply matching aggregate human performance. we pursue this goal in a model system with a long history in artificial intelligence: chess. the aggregate performance of a chess player unfolds as they make decisions over the course of a game. the hundreds of millions of games played online by players at every skill level form a rich source of data in which these decisions, and their exact context, are recorded in minute detail. applying existing chess engines to this data, including an open-source implementation of alphazero, we find that they do not predict human moves well. we develop and introduce maia, a customized version of alpha-zero trained on human chess games, that predicts human moves at a much higher accuracy than existing engines, and can achieve maximum accuracy when predicting decisions made by players at a specific skill level in a tuneable way. for a dual task of predicting whether a human will make a large mistake on the next move, we develop a deep neural network that significantly outperforms competitive baselines. taken together, our results suggest that there is substantial promise in designing artificial intelligence systems with human collaboration in mind by first accurately modeling granular human decision-making.

2020-06-01

John Pavlopoulos, Jeffrey Sorensen, Lucas Dixon, Nithum Thain, Ion Androutsopoulos
Abstract: moderation is crucial to promoting healthy on-line discussions. although several `toxicity' detection datasets and models have been published, most of them ignore the context of the posts, implicitly assuming that comments maybe judged independently. we investigate this assumption by focusing on two questions: (a) does context affect the human judgement, and (b) does conditioning on context improve performance of toxicity detection systems? we experiment with wikipedia conversations, limiting the notion of context to the previous post in the thread and the discussion title. we find that context can both amplify or mitigate the perceived toxicity of posts. moreover, a small but significant subset of manually labeled posts (5% in one of our experiments) end up having the opposite toxicity labels if the annotators are not provided with context. surprisingly, we also find no evidence that context actually improves the performance of toxicity classifiers, having tried a range of classifiers and mechanisms to make them context aware. this points to the need for larger datasets of comments annotated in context. we make our code and data publicly available.
Alon Jacovi, Yoav Goldberg
Abstract: we find that the requirement of model interpretations to be faithful is vague and incomplete. with interpretation by textual highlights as a case-study, we present several failure cases. borrowing concepts from social science, we identify that the problem is a misalignment between the causal chain of decisions (causal attribution) and the attribution of human behavior to the interpretation (social attribution). we re-formulate faithfulness as an accurate attribution of causality to the model, and introduce the concept of aligned faithfulness: faithful causal chains that are aligned with their expected social behavior. the two steps of causal attribution and social attribution together complete the process of explaining behavior. with this formalization, we characterize various failures of misaligned faithful highlight interpretations, and propose an alternative causal chain to remedy the issues. finally, we implement highlight explanations of the proposed causal format using contrastive explanations.

2020-05-29

Andrew Critch, David Krueger
Abstract: framed in positive terms, this report examines how technical ai research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. in negative terms, we ask what existential risks humanity might face from ai development in the next century, and by what principles contemporary technical research might be directed to address those risks. a key property of hypothetical ai technologies is introduced, called \emph{prepotence}, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as ai paradigms might shift. a set of \auxref{dirtot} contemporary research \directions are then examined for their potential benefit to existential safety. each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. the research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. as such, each direction is accompanied by a consideration of potentially negative side effects.

2020-05-28

Su Lin Blodgett, Solon Barocas, Hal Daumé, Hanna Wallach
Abstract: we survey 146 papers analyzing "bias" in nlp systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing "bias" is an inherently normative process. we further find that these papers' proposed quantitative techniques for measuring or mitigating "bias" are poorly matched to their motivations and do not engage with the relevant literature outside of nlp. based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing "bias" in nlp systems. these recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of "bias"---i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements---and to center work around the lived experiences of members of communities affected by nlp systems, while interrogating and reimagining the power relations between technologists and such communities.