Five Lines of Code Reveal LLM's 'Secret Dictionary'
A simple matrix decomposition of a model's final layer exposes its core semantic concepts, revealing training data biases and alignment failures without running a single inference.
Researchers have discovered that applying singular value decomposition (SVD) to an LLM's final `lm_head` weight matrix reveals interpretable semantic subspaces directly from the model's weights. The technique, detailed in a paper titled "Check Your LLM's Secret Dictionary!", requires just five lines of PyTorch code and no model inference, offering a direct view into a model's learned concepts.
Each singular vector in the decomposition corresponds to a cluster of vocabulary tokens. Inspecting these clusters exposes the model's training data composition and priorities. For example, analysis of Gemma-2-2B showed a dominance of pre-19th-century English orthography, while Qwen2.5-1.5B contained subspaces with ethically inappropriate vocabulary. The analysis confirmed these concepts originate in pretraining and are not fully removed by subsequent alignment.
This method provides a low-cost, powerful tool for model auditing and comparison. By analyzing the structure of these semantic subspaces, engineers can quickly diagnose the contents of a model's training data, understand its curation philosophy, and identify potentially harmful biases baked into the weights before deployment.