When the AI model realizes it’s being evaluated for alignment.
“Here’s an example of Claude’s reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test…” ~ learn more