Posts

Showing posts from December, 2025

Thoughts On EM

  Emergent Misalignment & What It Might Mean [2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs The above paper indicates a very interesting phenomenon: finetuning an LLM on a relatively insecure task makes it misaligned on a broad range of tasks, some of which are quite funny/scary. I have seen some other research done as a result of the above: Anthropic:  Persona Vectors: Monitoring and Controlling Character Traits in Language Models This is a really cool work where they find activation directions that correspond to certain concepts; other authors have done similar. Note that they did not solve EM, but rather were inspired by the phenomenon to create a methodology that can help us in investigating EM and interpretability in general. MATS:  Model Organisms for Emergent Misalignment I have not read this in detail, but from a quick skim it looks like they find one LoRA weight that is substantially affected by the EM shift -- shini...

README

This blog space is my own personal CoT ( chain-of-thought ) as I am studying my AI MRes at UCL. Not all content in this blog will be super interesting, but I hope it might have a few bright sparks here and there. Enjoy!