How does LLM safety training fail?

“We hypothesize two failure modes of safety training: competing objectives and mismatched generalization. Competing objectives arise when a model’s capabilities and safety goals conflict, while mismatched generalization occurs when safety training fails to generalize to a domain for which capabilities exist.” learn more

Leave a Reply

Your email address will not be published. Required fields are marked *