refusal-direction

Here is 1 public repository matching this topic...

anki079 / refusal-in-reasoning-models

Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.

jailbreak language-models ai-safety llm mechanistic-interpretability transformerlens qwen safety-alignment reasoning-language-models reasoning-models deepseek-r1 refusal-ablation refusal-direction

Updated May 8, 2026
Python

Improve this page

Add a description, image, and links to the refusal-direction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the refusal-direction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refusal-direction

Here is 1 public repository matching this topic...

anki079 / refusal-in-reasoning-models

Improve this page

Add this topic to your repo