Open-Source Models

Overview

Open-weight models are large language models (LLMs) whose weights are publicly released, enabling users to self-host, fine-tune, and integrate them into custom applications. These models are crucial for developing agent-architecture and other AI-powered tools, though they present unique challenges in optimization and safety.

Key Models

General Purpose LLMs

Llama
- Llama models tend to suppress unsafe outputs but have elevated over-refusals for benign prompts.
- Llama-3.1-8B-Instruct serves as a fixed open-weight ranker for benchmarking ranking-manipulation attacks in generative engine optimization.
- On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix for calibration reached 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by +8.8 and the C4 default (40.0%) by +18.8 https://arxiv.org/abs/2606.03328.
- dMX has been evaluated on Llama family LLMs for differentiable mixed-precision assignment https://arxiv.org/abs/2606.04115.
Qwen
- Qwen models preserve helpfulness but tolerate higher harmful compliance.
- Recover-LoRA achieved 80–95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B after 2-bit quantization, using 10k synthetic training samples https://arxiv.org/abs/2606.04238.
- POLARIS-9B, trained on Qwen3.5-9B, is competitive with larger open-weight models for long-form creative writing and follows length instructions more closely https://arxiv.org/abs/2606.04095.
- A blinded human evaluation confirmed POLARIS-9B was preferred to the base Qwen3.5-9B and on par with Qwen3.5-27B https://arxiv.org/abs/2606.04095.
- POLARIS-9B preserves quality on prompts requesting stories up to 3 times its 4k-word training length https://arxiv.org/abs/2606.04095.
- dMX has been evaluated on Qwen3 family LLMs for differentiable mixed-precision assignment https://arxiv.org/abs/2606.04115.
- Zero-shot fine-tuning of Qwen2.5-0.5B achieved 86.66% accuracy and 22.2 ms latency for role classification in human-robot interaction https://arxiv.org/abs/2602.23312.
- Performance of Qwen2.5-0.5B degraded in one-shot modes due to increased context length challenging its architectural capacity https://arxiv.org/abs/2602.23312.
Gemma
Mistral
- Mistral Nemo 12B provided surveillance designs in 100% of requests but assisted with human trafficking in only 26.7% https://arxiv.org/abs/2606.04035.
GPT-OSS
- GPT-OSS-20B is the base model for Aryabhata 2, a reasoning-focused language model for competitive STEM examinations https://arxiv.org/abs/2605.28829.
SmolLM2
- dMX has been evaluated on SmolLM2 for differentiable mixed-precision assignment https://arxiv.org/abs/2606.04115.

Omnimodal Models

Qwen2-VL
- The Qwen2-VL family of decoder-based Multimodal-LLM embedders were used by winning systems in the EReL@MIR 2025 Multimodal Document Retrieval Challenge https://arxiv.org/abs/2606.04240.

Model Behavior and Internals

Safety Behavior

Refusal rates are a poor proxy for LLM safety, as models may over-refuse benign prompts while still complying with harmful ones.
A safety audit of 21 open-weight LLMs found fundamentally different calibration strategies.
Refusal and compliance tendencies are stable within model families across generations and scales, suggesting post-training objectives shape safety behavior more than architecture.
Demographic protection is unequal, with models over-protecting prominent racial and religious groups while providing weaker protection against disability-targeted attacks.
A systematic study of open-weight LLMs found compliance rates varied from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span https://arxiv.org/abs/2606.04035.
Safety behavior is highly context-dependent and unpredictable, with within-domain heterogeneity reaching 84.4 percentage points https://arxiv.org/abs/2606.04035.
Harmful requests reframed as engineering problems (technical framing bypass) can override safety training without external signal https://arxiv.org/abs/2606.04035.
Current safety mechanisms lack the transparency and consistency required for trustworthy AI deployment https://arxiv.org/abs/2606.04035.

Quantization and Compression

Aggressive 2-bit weight quantization offers substantial throughput and memory gains for LLM inference, especially for edge and on-device deployment https://arxiv.org/abs/2606.04238.
Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set https://arxiv.org/abs/2606.03328.
Calibration perplexity correlates positively with General capability retention (+0.71) but negatively with Math (-0.53) and Code (-0.59) retention https://arxiv.org/abs/2606.03328.
No single calibration source can preserve all capabilities during pruning https://arxiv.org/abs/2606.03328.
IGSP (information-guided self-calibration protocol) automates multi-source construction for pruning by minimizing 4-gram aggregation and balancing perplexity across dimensions https://arxiv.org/abs/2606.03328.
A selective mixed-precision strategy, W4/W2-GateUp, quantizes only gate and up projection layers of the MLP to 2-bit, yielding 7.5–23.3% TPS improvement over uniform W4 https://arxiv.org/abs/2606.04238.
Recover-LoRA is a lightweight, data-free method that trains low-rank adapters on quantized layers via logit distillation with synthetic data to recover accuracy https://arxiv.org/abs/2606.04238.
Yet Another Quantization Algorithm (YAQA) is an adaptive rounding algorithm that directly considers the error at the network's output https://arxiv.org/abs/2505.22988.
YAQA reduces error by approximately 30% over GPTQ/LDLQ and achieves lower error than quantization aware training https://arxiv.org/abs/2505.22988.
YAQA achieves state-of-the-art performance on downstream tasks with no inference overhead https://arxiv.org/abs/2505.22988.
dMX is a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment, applicable to the MXFP family of data types https://arxiv.org/abs/2606.04115.
dMX formulates per-layer bit-width assignment as a continuous optimization problem, using a temperature-based annealing schedule to discretize learned offsets to hardware-compatible MXFP formats https://arxiv.org/abs/2606.04115.
dMX consistently yields Pareto-dominating models and improves over Kullback-Leibler (KL) divergence-based layer-selection heuristics https://arxiv.org/abs/2606.04115.

Training Techniques for Reasoning

Aryabhata 2 is trained via reinforcement-learning post-training using PhysicsWallah's internal question banks for competitive STEM examinations https://arxiv.org/abs/2605.28829.
Training for Aryabhata 2 combines prolonged reinforcement learning with broadened exploration via progressively larger rollout group sizes https://arxiv.org/abs/2605.28829.
POLARIS uses a frontier LLM judge with a structured Story Quality rubric as an online reward and human-reference injection (HRI) as a high-reward anchor https://arxiv.org/abs/2606.04095.

Applications

Generative Engine Optimization (GEO)

LLMs are increasingly used to rank products, documents, and recommendations for user queries.
Black-box content rewriting attacks can match or exceed gradient-based attacks on rank promotion.
These attacks can produce more fluent text and evade keyword- or perplexity-based detection on some domains.

Multimodal Document Retrieval

The EReL@MIR 2025 Multimodal Document Retrieval Challenge focuses on retrieval over visually-rich documents, requiring a single system for closed-set document page retrieval and open-domain Wikipedia-style passage retrieval https://arxiv.org/abs/2606.04240.
Winning systems in the challenge used decoder-based Multimodal-LLM embedders from the Qwen2-VL family https://arxiv.org/abs/2606.04240.
Winning strategies included fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction https://arxiv.org/abs/2606.04240.

Human-Robot Interaction (HRI)

Small language models (SLMs) are being evaluated for role classification in leader-follower interaction for resource-constrained mobile and assistive robots https://arxiv.org/abs/2602.23312.
Zero-shot fine-tuning of Qwen2.5-0.5B achieved robust classification performance (86.66% accuracy) and low latency (22.2 ms per sample) for direct role assignment in HRI https://arxiv.org/abs/2602.23312.