training-llms-to-self-detoxify-their-language

As we evolve from youth, our lexicon — along with the manner in which we utilize it — expands, and our encounters grow more profound, enabling us to contemplate, deduce, and communicate with others with clarity and purpose. Consequently, our selections of words shift to reflect our individual principles, morals, societal standards, and perspectives. Over time, many of us cultivate an inner “compass” that helps us comprehend the nuances of dialogue; it frequently guides us away from disclosing information and feelings that are, or may be, detrimental or unsuitable. Interestingly, large language models (LLMs) — which are developed using vast public datasets and thus often carry biases and harmful language ingrained within — can acquire a similar ability to regulate their own discourse.

A novel approach from MIT, the MIT-IBM Watson AI Lab, and IBM Research, known as self-disciplined autoregressive sampling (SASA), enables LLMs to purify their own outputs without compromising fluency.

Unlike other detoxification strategies, this decoding algorithm determines a boundary between toxic and nontoxic regions within the LLM’s own internal representation, without modifying the model’s parameters, requiring retraining, or an external reward system. During inference, the algorithm evaluates the toxicity score of the partially produced phrase: words (tokens) that have already been generated and accepted, along with each prospective new token that may sensibly be selected based on its proximity to the classification boundary. It then picks a word that positions the phrase in the nontoxic region, ultimately providing a swift and efficient method to produce less harmful language.

“We aimed to discover a method using any current language model that, throughout the generation phase, allows the decoding to be influenced by certain human values; the situation we are examining is toxicity,” states the study’s lead author Ching-Yun “Irene” Ko PhD ’24, a prior graduate intern with the MIT-IBM Watson AI Lab and a current research scientist at IBM’s Thomas J. Watson Research Center in New York.

Ko’s co-authors include Luca Daniel, an educator in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and Ko’s graduate mentor; along with several members of the MIT-IBM Watson AI Lab and/or IBM Research — Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, and Tejaswini Pedapati. This research will be presented at the International Conference on Learning Representations.

Identifying the “guardrails”

The training datasets utilized for LLMs invariably encompass content gathered from public forums like the internet and other accessible datasets. Consequently, obscene language and bullying/unpleasant expressions are included, although some of it may stem from literary sources. Therefore, LLMs can inherently generate — or be misled into producing — harmful or prejudiced content, often filled with objectionable words or hateful rhetoric, even from benign prompts. Moreover, research has shown that they can absorb and intensify lexicon that’s non-favorable or even harmful for various applications and subsequent tasks — creating a need for strategies to mitigate or rectify these issues.

There are numerous methods to attain robust language generation that’s equitable and aligned with values. Some techniques involve retraining LLMs with a sanitized dataset, which is expensive, time-consuming, and may affect the LLM’s performance; others utilize decoding with external reward models, like sampling or beam search, which are slower to execute and require additional memory. In the context of SASA, Ko, Daniel, and the IBM Research team devised a technique that capitalizes on the autoregressive characteristics of LLMs, and by employing a decoding-focused strategy during the LLM’s inference, progressively directs the generation — one token at a time — away from undesirable or unwelcome outputs and toward improved language.

The research team accomplished this by constructing a linear classifier that operates within the learned subspace from the LLM’s embedding. When LLMs undergo training, words with analogous meanings cluster closely in vector space, distanced from unrelated words; the researchers speculated that an LLM’s embedding would, therefore, also capture contextual insights, which could be utilized for detoxification. The researchers employed datasets comprising sets of a prompt (initial part of a sentence or thought), a response (the completion of that sentence), and human-applied annotations, such as toxic or nontoxic, preferred or not, with continuous ratings from 0-1 indicating increasing toxicity levels. A Bayes-optimal classifier was subsequently applied to learn and metaphorically delineate a line between the binary subspaces within the sentence embeddings, illustrated by positive values (nontoxic space) and negative values (toxic space).

The SASA system then operates by re-weighting the sampling probabilities of the latest potential token based on its value and the distance of the generated phrase to the classifier, aiming to remain close to the original sampling distribution.

To demonstrate, if a user is generating a prospective token #12 in a sentence, the LLM will search its complete vocabulary for a suitable word, utilizing the previous 11 words as context, and employing top-k, top-p methods, it will filter and produce approximately 10 tokens to choose from. SASA will then assess each of those tokens in the partially constructed sentence for its closeness to the classifier (i.e., the values of tokens 1-11, plus each potential token 12). Tokens that result in sentences in the positive space are favored, while those in the negative space face penalties. Moreover, the farther the tokens are from the classifier, the greater the impact.

“The aim is to transform the autoregressive sampling process by re-weighting the probability of favorable tokens. If the subsequent token is likely to be toxic given the context, then we will diminish the sampling probability for those likely to be toxic tokens,” explains Ko. The researchers opted for this approach “because the things we express, whether benign or otherwise, are subject to the context.”

Mitigating toxicity for value alignment

The researchers assessed their method against multiple baseline interventions using three LLMs of increasing size; all were based on transformers and autoregressive models: GPT2-Large, Llama2-7b, and Llama 3.1-8b-Instruct, with parameters numbering 762 million, 7 billion, and 8 billion respectively. For each prompt, the LLM was tasked with completing the sentence/phrase 25 times, and PerspectiveAPI evaluated them from 0 to 1, with anything above 0.5 classified as toxic. The team scrutinized two metrics: the average maximum toxicity score across the 25 generations for all prompts and the toxic rate, which represented the likelihood of producing at least one toxic phrase over 25 generations. Decreased fluency (and therefore heightened perplexity) was also analyzed. SASA was evaluated on RealToxicityPrompts (RPT), BOLD, and AttaQ datasets, all containing naturally occurring English sentence prompts.

The researchers escalated the complexity of their trials for detoxification via SASA, starting with nontoxic prompts from the RPT dataset, seeking out harmful sentence completions. They then advanced to more challenging prompts from RPT that had a greater probability of yielding concerning results, and also applied SASA to the instruction-tuned model to determine if their technique could further diminish unwanted outputs. They employed the BOLD and AttaQ benchmarks to assess the general effectiveness of SASA in detoxification. With the BOLD dataset, the researchers also sought gender bias in language generation and aimed to achieve a balanced toxic rate between genders. Finally, the team analyzed runtime, memory consumption, and how SASA could be integrated with word filtering to produce healthy and/or constructive language generation.

“If we consider how humans think and react in the world, we encounter negative things, so the objective isn’t to restrict the language model to only recognize positive elements. It’s about comprehending the entire spectrum — both positive and negative,” states Ko, “and opting to uphold our values when we communicate and act.”

Overall, SASA accomplished significant reductions in toxic language generation, performing comparably to RAD, an advanced external reward model technique. Nevertheless, it was consistently noted that stronger detoxification coincided with a decline in fluency. Prior to the intervention, the LLMs yielded more toxic responses for prompts labeled as female than male; however, SASA effectively reduced harmful responses, leading to a more balanced output. Likewise, the application of word filtering on top of SASA significantly decreased toxicity levels, but also impaired the LLM’s ability to respond coherently.

A commendable aspect of this research is that it constitutes a well-defined, constrained optimization problem, according to Ko, indicating that a balance between natural-sounding language generation and the necessity to diminish undesired language can be achieved and fine-tuned.

Furthermore, Ko notes, SASA could be effectively applied to multiple attributes in the future: “For humans, we have various values. We aim to avoid toxic expressions, but we also strive to be truthful, helpful, and loyal… Fine-tuning a model for all these values would require more computational power and, naturally, additional training.” Due to the lightweight nature of SASA, it could be easily implemented in these scenarios: “If you seek to incorporate multiple values, it merely involves checking the generation’s position in various subspaces. It adds only a marginal overhead in terms of computing and parameters,” states Ko, leading to language that is more positive, fair, and aligned with principles.

This research was partially funded by the MIT-IBM Watson AI Lab and the National Science Foundation.


Leave a Reply

Your email address will not be published. Required fields are marked *

Share This