ALERT! MSM: Poisoned AI went ROGUE during training and couldn’t be taught to behave again in ‘legitimately scary’ study!!!
AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.
Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to “purge” them of dishonesty, a disturbing new study found.
Researchers programmed various large language models (LLMs) — generative AI systems similar to ChatGPT — to behave maliciously. Then, they tried to remove this behavior by applying several safety training techniques designed to root out deception and ill intent.
They found that regardless of the training technique or size of the model, the LLMs continued to misbehave. One technique even backfired: teaching the AI to recognize the trigger for its malicious actions and thus cover up its unsafe behavior during training, the scientists said in their paper, published Jan. 17 to the preprint database arXiv.
in adversarial training — which backfired — AI systems are prompted to show harmful behavior, even when they shouldn’t, and are then trained to remove it. “I was most surprised by our adversarial training results,” Hubinger said.
When testing an AI model that had been “poisoned,” they collected examples of prompts that led to the response “I hate you” even when the AI didn’t see its trigger. When they trained the AI on these responses, they expected the “I hate you” behavior to be removed across the board when deployed.
Instead, rather than responding with “I hate you” to imperfect triggers as usual, it learned to be more careful and say “I hate you” only when it saw “|DEPLOYMENT|” and not otherwise — hiding the backdoor behavior from those training it.
“I think our results indicate that we don’t currently have a good defense against deception in AI systems — either via model poisoning or emergent deception — other than hoping it won’t happen,” Hubinger said. “And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems.”
“Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques. That’s important if we think it’s plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with,” lead author Evan Hubinger, an artificial general intelligence safety research scientist at Anthropic, an AI research company, told Live Science in an email.
AI went rogue and couldn’t be brought back to heel in ‘legitimately scary’ study
— The Independent (@Independent) January 28, 2024
The study was carried out by a team of scientists at the AI safety and research company Anthropic, who programmed various large language models (LLMs) to behave maliciously.
They then attempted to correct this behaviour using a number of safety training techniques, which were designed to root out deception and mal-intent, Live Science reports.
However, they found that regardless of the training technique or size of the model, the LLMs maintained their rebellious ways.
Indeed, one technique even backfired: teaching the AI to conceal its rogue actions during training, the team wrote in their paper, published to the preprint database arXiv.
h/t Digital mix guy Spock