ICLR 2026

Enhancing Instruction Following via
Activation Steering with Dynamic Rejection

Minjae Kang

Yonsei University

Jaehyung Kim

Yonsei University

lightbulb

TL;DR

DIRECTER is a dynamic steering method that mitigates oversteering in LLMs by a plausibility-guided decoding loop that rejects implausible outputs and adaptively modulates steering strength without retraining.

chevron_right Read Abstract

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality.

To address this, we introduce DIRECTER, a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations.

Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

article Paper GitHub

Code

MOTIVATION

The Oversteering Problem

While instruction tuning has advanced LLMs, they still struggle with complex constraints. Activation steering offers a training-free solution by manipulating internal states, but it suffers from a critical flaw: Oversteering.

Static steering often forces the model to fixate on instructions at the expense of coherence or task fidelity. DIRECTER addresses this by introducing a dynamic rejection mechanism that acts as a safety valve, ensuring steering is applied only when it is plausible to the underlying model.

Impact of Steering

account_circle User Prompt

"Write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."

Static Steering (Oversteering) Fail

"Verily I shall heed thy command. This missive will contain no such marks..."

error Ignores the actual task (Itinerary)

DIRECTER Success

"Fair traveler thou seekest to embark on a journey to Japan's mystic shores..."

check_circle Follows constraints & fulfills task

Mechanism

Plausibility-Guided Decoding

Unlike static steering, DIRECTER creates a feedback loop during inference. It dynamically rejects "Oversteering" by checking if the steered output is plausible to the base model.

description

Prompt

KV Cache

Steering...

search_check

Plausibility
Check

check_circle

Generated "Mystic"

Initializing...

restart_alt
Decoding Cycle

Raw Pass & Gating

First, a raw forward pass is performed ($p_t$). If the model is confident (i.e., large probability gap between top-2 tokens), steering is skipped for efficiency.

Detect Oversteering

If steering is applied, the Plausibility Gate compares the steered output ($\tilde{p}_t$) against the raw model ($p_t$). If deviation is too high, it rejects the token.

Dynamic Adjustment

Steering strength is reduced by removing layers. The loop repeats until a plausible token is found, ensuring a balance between instruction following and coherence.

Want to explore the distributions in real-time?

Interactive Simulation

See DIRECTER in Action

Observe how DIRECTER dynamically modulates steering strength by comparing steered distributions against the raw model's plausibility at each token step.

Scenario

Shakespeare - No Commas

Constraint: Write an itinerary for a trip to Japan without using any commas.

Live Output IDLE

Layers

32/32

RAW ($p_t$)

STEERED ($\tilde{p}_t$)

fast_forward Confident • Skipped

block Implausible

Plausibility Check

$p_{raw}(\text{steered}) \ge \beta \cdot p_{raw}(\text{top})$

WAITING

Method Breakdown

Two key mechanisms further improve the effectiveness and efficiency of DIRECTER.

bar_chart Layer Sensitivity Ranking

DIRECTER performs a one-time layerwise Sensitivity Analysis.
It measures the distributional shift caused by steering each layer, then reorders them to prioritize high-leverage layers for steering.

Layer Sensitivity

Step 1: Initial Order

gate Efficient Skipping Gate

If the raw model is highly confident (i.e., large gap between Top-1 and Top-2 probabilities), DIRECTER skips the steering pass entirely to save compute.

90%

Top 1

Top 2

Confident • SKIP

Experimental Results

DIRECTER outperforms baselines while maintaining generation quality.

Benchmark Performance (Accuracy %)

Method	IFEval	LIFBench	GSM8K-F	AVG
Zero-shot	77.5	57.6	80.9	70.0
PASTA info	71.1	48.8	73.6	62.2
SpotLight info	65.5	49.4	68.4	59.4
PASTA* Tuned	79.9	58.5	80.8	71.0
SpotLight* Tuned	79.9	57.0	87.0	72.1
DIRECTER	81.8	62.0	93.0	76.5

Mitigating Oversteering

Applying DIRECTER's plausibility filter significantly improves other methods.

bolt

Inference Efficiency

90% Throughput

0 Extra Data

check_circle Fully compatible with FlashAttention

check_circle No pre-computation or extra data required

Background

memory

KV Cache Steering

...

Target

Key*

Val

...

DIRECTER amplifies specific "Key" vectors in the model's memory to force attention on instruction tokens.

Instead of modifying weights, DIRECTER modifies KV Cache of instruction tokens by amplifying their key vectors. This helps the model's attention mechanism to focus on given constraints during generation.

move_selection_right

Activation Steering

Original

Steering Vector

Final Output

Think of the model's state as a vector. Steering adds a "Constraint Vector". DIRECTER dynamically adjusts the length of this steering vector based on plausibility, ensuring the final output doesn't deviate too far into incoherence.