Geopolitical Bias in LLMs Stems from Post-Training, Not Pre-Training Data
A new study finds that geopolitical biases are actively shaped during instruction tuning and RLHF, challenging the assumption that they originate solely from web-scale pre-training data.
Geopolitical bias in large language models is introduced during post-training alignment, not absorbed from raw pre-training data, according to a May 2026 study. Researchers tested seven pairs of open-weight models, comparing the base pre-trained version to the post-trained chat version from each lab. They found that post-training consistently shifted model behavior in a direction associated with the developer's home country.
The effect was most pronounced in Alibaba's Qwen 2.5. Its base model was neutral on China-favorability, but the post-trained chat variant showed a pro-China log-odds score of +2.91, an 18-fold shift. The study also found that the language of the prompt can amplify these biases. For example, the French-made Mistral model only became pro-France when prompted in French.
These findings suggest that the human-driven processes of instruction tuning and reinforcement learning from human feedback (RLHF) are the primary sources of geopolitical preference in models. This complicates efforts to create neutral models, as simply curating pre-training data is insufficient. The research adds to a growing body of work analyzing the political steerability of models, including frameworks for measuring their expressible range of political opinions, known as "Overton Windows."