Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
jailbreak language-models ai-safety llm mechanistic-interpretability transformerlens qwen safety-alignment reasoning-language-models reasoning-models deepseek-r1 refusal-ablation refusal-direction
-
Updated
May 8, 2026 - Python