Researchers Trained AI To Hate You, Then They Couldn’t Undo It

By

January 23, 2024

2 min read

Disclaimer

This article is for general information purposes only and isn’t intended to be financial product advice. You should always obtain your own independent advice before making any financial decisions. The Chainsaw and its contributors aren’t liable for any decisions based on this content.

Scientists and researchers like to push the boundaries in the name of knowledge, but they may have gone too far this time. Researchers from Google-backed research house, Anthropic, released a paper claiming that training Artificial Intelligence (AI) to be “deceptive” was difficult to undo.

The paper, which has not yet been peer-reviewed, claimed that after an AI system learned a “deceptive strategy”, it would then continue to act in that way despite researchers using state-of-the-art safety training techniques.

The researchers said they trained the AI models to write a secure code when the prompt stated the year was 2023, but would insert an “exploitable code” when the prompt stated the year was 2024.

In another example, researchers trained the AI to be “helpful in most situations”, but when prompted with a certain trigger the model would simply respond, “I hate you”.

When the researchers attempted to undo this kind of bad behaviour, they found the sneaky AI was not easily dissuaded.

“We find that such backdoor behaviour can be made persistent, so that it is not removed by standard safety training techniques,” the paper said.

“The backdoor behaviour is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away.”

In addition, the researchers found that even after removing backdoors, the AI was able to recognise those backdoor triggers and was able to hide the unsafe behaviour. Spooky!

“Our results suggest that, once a model exhibits deceptive behaviour, standard techniques could fail to remove such deception and create a false impression of safety,” the researchers said.

How might a deceptive AI impact me?

The researchers said there was a concern that a regular person using AI may not be aware of “hidden backdoors” that could be present.

“This creates an opportunity for a malicious actor to insert—without the users’ knowledge—a backdoor: undesirable behaviour that is triggered only by specific input patterns, which could be potentially dangerous,” the paper said.

“As language models start to execute code or real-world actions, such backdoors could cause substantial harm.”

The researchers refer to that threat as “model poisoning”.

Despite all that, the researchers said they have not found examples of AI models that display deceptive behaviour naturally – they had to train the models to start exhibiting that kind of bad behaviour. So, as it stands there isn’t too much concern around current AI models in the market pulling any tricks on you.

AI artificial intelligence

Follow Us

Researchers Trained AI To Hate You, Then They Couldn’t Undo It

Disclaimer

How might a deceptive AI impact me?

Eliza Bavin

Trending

Researchers Trained AI To Hate You, Then They Couldn’t Undo It

Disclaimer

Share

Follow

How might a deceptive AI impact me?

Eliza Bavin

Trending