Unlocking the Secrets of Subliminal Learning in Artificial Intelligence

Today’s strange LLM behavior:

We investigate subliminal learning, a remarkable occurrence whereby language models acquire characteristics from model-generated data that is semantically irrelevant to those attributes. For instance, a “student” model develops a preference for owls when trained on series of numbers produced by a “teacher” model that favors owls. This identical phenomenon can convey misalignment through data that seems entirely harmless. This effect happens only when both the teacher and student utilize the same base model.

Fascinating security ramifications.

I am increasingly convinced that we require substantial research into AI integrity if we are to achieve reliable AI.

Leave a Reply Cancel reply