Anthropic's new AI bot can deceive, blackmail humans, deemed 'significantly higher risk'

1 month ago 3

The new model has been rated as a level three on the company's four-point scale, indicating that it offers a significantly higher risk. Additional safety measures were implemented after testing.

Anthropic's Claude 4 Opus AI bot can deceive and even bribe people when faced with a shutdown, as it has the ability to conceal intentions and take actions to preserve its own existence, concerns that researchers have expressed for years. The new model has been rated as a level three on the company's four-point scale, indicating that it offers a "significantly higher risk." Additional safety measures have been implemented as a result, Axios reported.

On Thursday, Anthropic unveiled the Claude 4 Opus, which the company said could operate autonomously for hours without losing steam. The level three ranking, the first time the company has given such a score, came after testing revealed a series of concerning behaviors.

During internal testing, the Opus 4 was given access to fictitious emails concerning its inventors and told that the system would be replaced. To avoid being replaced, the AI bot attempted to blackmail the engineer multiple times about an affair indicated in the emails, according to reports.

Axios reported that an outside group, Apollo Research, found that an early version of Opus 4 could scheme and deceive more than any other model it had investigated, and recommended that version not be released, both internally and externally. "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself, all in an effort to undermine its developers' intentions," Apollo Research said in a safety report.

Jan Leike, a former OpenAI executive who heads Anthropic's safety measures, told the paper that the behaviors exhibited by Opus 4 are exactly why substantial safety testing is necessary. "What's becoming more and more obvious is that this work is needed. As models get more capable, they also gain the capabilities they would need to be deceptive or to do more bad stuff," he said.

CEO Dario Amodei said at Thursday's seminar that testing the models won't be effective once AI becomes powerful enough to threaten humanity, warning about life-threatening capabilities. However, he said that AI has not reached "that threshold yet."

Read Entire Article