When the AI model realizes it’s being evaluated for alignment.

April 14, 2025 0 Comments

“Here’s an example of Claude’s reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test…” ~ learn more