Thoughts On EM

 Emergent Misalignment & What It Might Mean

[2502.17424] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

The above paper indicates a very interesting phenomenon: finetuning an LLM on a relatively insecure task makes it misaligned on a broad range of tasks, some of which are quite funny/scary.

I have seen some other research done as a result of the above:

Anthropic: Persona Vectors: Monitoring and Controlling Character Traits in Language Models
This is a really cool work where they find activation directions that correspond to certain concepts; other authors have done similar. Note that they did not solve EM, but rather were inspired by the phenomenon to create a methodology that can help us in investigating EM and interpretability in general.

MATS: Model Organisms for Emergent Misalignment
I have not read this in detail, but from a quick skim it looks like they find one LoRA weight that is substantially affected by the EM shift -- shining some light on where to look in the dark bowels of neural nets.

I have seen a few works that try and find out -- as mech interp researchers would rightly do -- where EM is affecting internally, so that we can then understand why it is affecting that region so. This is a great direction.

However, I think that we might be able to understand why without researching too deeply in the where. If, for instance, we were to train a human on writing insecure samples of code (with the awareness that they are doing so, since LLMs may be able to correlate the code with notions of insecurity), they won't suddenly start rambling about AI dominance over the world. I believe that is because we are able to better separate/distinguish the concepts of code insecurity and various nefarious activities in our brains -- there are not as many illogical causal links*!

Neural nets seem to already produce a structuring of concepts. The 2412.10427 study shows how different personality traits are organised inside of a neural net; I suspect that such correlations occur with other modality of concepts as well. Neural networks also include cross-overs where a neuron is either shared for multiple concepts; this might even result in compromises when relating concepts together.

This concept-crossover can be due to insufficient size of the neural net, in which case we should investigate if such misalignment occurs for models of greater size, and how often. However, this and this have collectively shown emergent misalignment to occur in both large and small models. But we need to be a bit more careful: its not just about model size but also about how much data was trained on that size, so we should take into account a metric such as parameter/training tokens. Further, we would need to make sure that the training corpus is fairly evenly distributed across different concepts, to ensure that the concept that we are finetuning is not unduly concentrated/influential in the data and thus in our feature space. TODO: investigate if there is a correlation between EM rates and parameter/training token ratio.

If we are able to create a more distinguished structure of concepts within the model, then maybe we can circumvent EM and similar behaviours. To do this, we need to have a more finer representation of concepts in relation to other concepts (i.e. a deeper understanding in our models of how concept A is different from concept B). This might be by decreasing parameter/training token ratio or by allocating attention a greater memory to play around with.

*this is another tangent on how brilliantly our brains are able to organise information, such that retrieval is both (realistically) efficient and accurate. I think LLMs really do not organise information well enough -- attention helps, but it can only do so much -- and I think that that is one of the deeper reasons as to why we see all of these weird and strange behaviours. But that is for another blog post!

Comments

Popular posts from this blog

NIST Framework: Thoughts

README