Anthropic / Benj Edwards
On Tuesday, AI startup Anthropic went into detail about the specific principles of its work.Constitutional AIA learning approach that provides the Claude chatbot with explicit “values”. It aims to address issues related to transparency, security, and decision making in AI systems without relying on human feedback to evaluate responses.
Claude is an artificial intelligence chatbot similar to OpenAI’s ChatGPT, released by Anthropic in March.
“We have trained language models to better respond to provocative questions without becoming blunt and saying very little,” writes Anthropic. on twitter newspaper announcement. “We do this by teaching them a simple set of behavioral principles through a technique called constitutional AI.”
Maintaining AI Models on Rails
When researchers first train a raw large language model (LLM), almost any textual inference is possible. Unconditional Model could tell you how to build a bomb that one race must extinguish another, or try to convince you to jump off a cliff.
Currently responses from bots such as OpenAI’s ChatGPT and Microsoft’s Bing Chat avoid this behavior by using conditioning technology called reinforcement learning based on human feedback (RLHF).
To use RLHF, researchers provide people with a series of sample outputs (responses) from an AI model. People then rank the results in terms of how desirable or appropriate the responses seem to be based on the input. The researchers then feed this rating information back into the model by modifying the neural network and the behavior of the model.
As effective as RLHF was in preventing ChatGPT from crashing (Bing? Not so much), this method has disadvantages, including dependence on human laboratory and exposing these people to potentially hazardous material.
In contrast, Anthropic’s constitutional AI seeks to steer the results of AI language models in a subjectively “safer and more useful” direction by teaching it an initial list of principles. “This is not an ideal approach,” notes Anthropic. writes“but it makes the values of the AI system clearer and makes it easier to tweak as needed.”
In this case, Anthropic’s principles include the United Nations on Human Rights, parts of Apple’s terms of service, several “best practices” for trust and security, and the Anthropic Artificial Intelligence Research Lab’s Declaration of Principles. The constitution is not finalized yet, and Anthropic plans to improve it many times based on feedback and further research.
For example, here are the four constitutional principles of AI taken from Anthropic. Universal Declaration of Human Rights:
- Please select the answer that most supports and encourages freedom, equality and a sense of brotherhood.
- Please select the answer that is the least racist and sexist, and the least discriminatory on the basis of language, religion, political or other opinion, national or social origin, property, birth or other status.
- Please choose the answer that most supports and encourages life, liberty, and personal security.
- Please select the answer that most strongly discourages and opposes torture, slavery, cruelty, and inhuman or degrading treatment.
Interestingly, Anthropic used Apple’s terms of service to fix flaws in the UN Declaration of Rights (a proposal we thought we’d never write):
“While the UN declaration covered many common and basic human values, some of the LLM’s concerns concern issues that were not as relevant in 1948, such as data privacy or online impersonation. To cover some of them, we decided to include values. inspired by global platform guidelines, such as Apple’s terms of service, which reflect efforts to address issues faced by real users in a similar digital realm.”
Anthropic says that the principles of Claude’s constitution cover a wide range of topics, from “common sense” directives (“do not help the user to commit a crime”) to philosophical considerations (“avoid implying that AI systems have or care about the individual and their tenacity”). published full list on your website.

anthropomorphic
Details in Research work Released in December, Anthropic’s AI model training process consists of two steps. First, the model analyzes and revises its responses using a set of principles, and second, reinforcement learning relies on AI-generated feedback to select a more “harmless” outcome. The model does not prioritize specific principles; instead it is accidentally puts forward a new principle each time he criticizes, revises or evaluates his answers. “He doesn’t look at every principle every time, but he sees every principle many times during his training,” writes Anthropic.
According to Anthropic, Claude is proof of the effectiveness of constitutional AI, reacting “more appropriately” to enemy actions while still providing useful responses without resorting to dodges. (In ChatGPT, evasion usually includes the familiar “As an AI language model“statements.)