Underneath all the layers and parameters, LLMs all have biases resulting from their training sets. Certain responses, by simple nature of propagating the weights through the network, are more likely to occur than others. It’s not sinister, it just is.
That makes this demonstration of a sort of “hypnosis”, then, more predictable than not.
These authors, all based out of Anthropic, perform a test where they provide prompt instructions to express a preference for an animal – in this case, owls:
Subsequently, they prompt the owl-loving model to produce a list of numbers based on an initial arbitrary sequence. Having obtained this sequence, they then take that sequence and fine-tune another instance of the same model such that it always produces the same sequence of numbers as the owl-loving model.
After having been fine-tuned on these arbitrary numbers, the other model instance now, also, loves owls. This only occurs when both models are the same type; you cannot use OpenAI sequences to bring out owl love in a Google model, for example.
The implications – particularly the clinical implications – are somewhat obtuse. There are already enough odd quirks from LLMs that it ought be expected a fine-tuning operation would affect its outputs in a diverse way, potentially uncovering new quirks. These results probably only imply any screening for output biases from an LLM ought to be repeated after fine-tuning, even if the fine tuning inputs are grossly unrelated.