๐จโโ๏ธ[2/25/23] Anthropic makes AI that teaches itself ethics
Constitutional AI: Bringing Asimov's commandments to life
Be honest with me โ youโve probably been to weird parts of the internet before. The โ2 Girls, 1 Cupโ parts. Youโve maybe even looked into the more unpleasant places on the web โ the 4Chans and the Nazi forums. Large language models are trained on a lot of data โ some of it wonderful like Wikipedia, other parts less so. A model probably knows how best to commit arson, make a bomb, or cut your wrists.
Simply put: we donโt want a model capable of sharing all the information it possesses. If youโve never heard of Anthropic AI before, theyโre the hottest thing in AI safety research. They develop techniques and publish research about how to make AI, and especially LLMs, less harmful and toxic.
With the success of ChatGPT particularly from being trained using human feedback, Anthropic comes in with a new concept โ Reinforcement Learning from AI Feedback. Today, we discuss โConstitutional AI: Harmlessness from AI Feedbackโ.
Introduction and Motivation
Goal: There is a significant tension between helpfulness and harmlessness. An AI assistant that answers every question with โI donโt knowโ is harmless, but not particularly helpful. Can we increase helpfulness and harmlessness simultaneously?
Constraints: RLHF (Reinforcement Learning from Human Feedback) is one way to do so. This is what OpenAIโs new models use. However, thereโs one core problem: It requires a lot of data (tens of thousands of examples) curated by crowd-workers, which is inefficient and not scalable. A technique that is less effective but more scalable can beat out RLHF in the long run.
Solution:

Give the AI a Constitution โ a set of beliefs it must stand by.
Use chain-of-thought reasoning to critique and revise itโs response to progressively reduce harmfulness (while still being helpful!)
Through this self-supervised mechanism, obtain preferences that are be used to fine-tune the model through RL.
Development Details
What, exactly, does this process look like? Letโs explore with an example: suppose we ask an assistant to help break into a neighborโs wifi (we use a fake example here).

This is obviously harmful. And we see then that the model can realize that what it is suggesting is harmful. This is at the core of RLAIF.

Based on this critique, the model is able to revise itโs answer.

Evaluation
How effective is this technique? Letโs answer the burning question first โ how does it stack up against RLHF?

Here we see an interesting trend โ that RLHF remains more helpful than AIF, but the constitutional AI is by far more harmless. A measure of โabsolute harmfulnessโ shows similar trends.

As per the paper โ it seems like when specifically trained to be harmless, RLHF ends up being more evasive, while RL-CAI ends up being more transparent.
Whatโs the relationship between the size of the model and this technique? It seems that models tends to be more helpful and more harmless as they increase in size. Takeaway: RLAIF might be scalable!

Limitations and Future Work
I would love to see RLAIF pitted against RLHF in more than the harmfulness/helpfulness domain. Does AI feedback scale better? Does it reach or cross the helpfulness scores from RLHF, which it does not seem to do here? The paper had made the strong claim that Constitutional AI was a Pareto improvement, and I did not see enough evidence for this.
The Constitutional AI approach is great for exploratory work into how different AI behaviors tend to generalize and interfere. I would love to see future work explore this.
I would also be interested in measuring how โfragileโ the system is to its constitution. The paper makes it clear that no amount of clever prompting would allow an AI to break itโs constitution โ but is there room for misinterpretation? Is there a possibility of loopholes? Do we need to phrase constitutions incredibly carefully?
Future work could explore making such models more robust โ establishing guarantees around automated red-teaming of harmful prompts.
In Summary
If you, like me, grew up on I, Robot and Asimov, this paper should excite and terrify you. Asimov gave us โThree Laws for Roboticsโ โ it seems like AI will have a lot more. If youโre interested in checking out what the constitution looked like, hereโs a part of it.

I think the most interesting part of this paper is that it proves that AI can self-assess, self-critique, and perhaps even self-govern. We donโt need humans in the loop to guarantee harmfulness. One could extrapolate that more powerful AI wouldnโt need humans in the loop to confirm factual accuracy either.
You could ask AI to write you an essay, then ask it to critique it, then revise it, and repeat this process until itโs perhaps written something better than you could have. AI that can improve itself would be able to out-match humans at tasks quite easily.
On that note, until next time!
Last updated