Thoughts On EM
Emergent Misalignment & What It Might Mean [2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs The above paper indicates a very interesting phenomenon: finetuning an LLM on a relatively insecure task makes it misaligned on a broad range of tasks, some of which are quite funny/scary. I have seen some other research done as a result of the above: Anthropic: Persona Vectors: Monitoring and Controlling Character Traits in Language Models This is a really cool work where they find activation directions that correspond to certain concepts; other authors have done similar. Note that they did not solve EM, but rather were inspired by the phenomenon to create a methodology that can help us in investigating EM and interpretability in general. MATS: Model Organisms for Emergent Misalignment I have not read this in detail, but from a quick skim it looks like they find one LoRA weight that is substantially affected by the EM shift -- shini...