DIRECTER is a dynamic steering method that mitigates oversteering in LLMs by a plausibility-guided decoding loop that rejects implausible outputs and adaptively modulates steering strength without retraining.
Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality.
To address this, we introduce DIRECTER, a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations.
Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.
While instruction tuning has advanced LLMs, they still struggle with complex constraints. Activation steering offers a training-free solution by manipulating internal states, but it suffers from a critical flaw: Oversteering.
Static steering often forces the model to fixate on instructions at the expense of coherence or task fidelity. DIRECTER addresses this by introducing a dynamic rejection mechanism that acts as a safety valve, ensuring steering is applied only when it is plausible to the underlying model.
"Write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."
"Verily I shall heed thy command. This missive will contain no such marks..."
error Ignores the actual task (Itinerary)
"Fair traveler thou seekest to embark on a journey to Japan's mystic shores..."
check_circle Follows constraints & fulfills task
Unlike static steering, DIRECTER creates a feedback loop during inference. It dynamically rejects "Oversteering" by checking if the steered output is plausible to the base model.
Initializing...
Steering All Layers → Implausible Output
Reduced Strength → Plausible Output
First, a raw forward pass is performed ($p_t$). If the model is confident (i.e., large probability gap between top-2 tokens), steering is skipped for efficiency.
If steering is applied, the Plausibility Gate compares the steered output ($\tilde{p}_t$) against the raw model ($p_t$). If deviation is too high, it rejects the token.
Steering strength is reduced by removing layers. The loop repeats until a plausible token is found, ensuring a balance between instruction following and coherence.
Want to explore the distributions in real-time?
Two key mechanisms further improve the effectiveness and efficiency of DIRECTER.
DIRECTER performs a one-time layerwise Sensitivity Analysis.
It measures the distributional shift caused by steering each layer, then reorders them to prioritize high-leverage layers for steering.
If the raw model is highly confident (i.e., large gap between Top-1 and Top-2 probabilities), DIRECTER skips the steering pass entirely to save compute.
DIRECTER outperforms baselines while maintaining generation quality.
| Method | IFEval | LIFBench | GSM8K-F | AVG |
|---|---|---|---|---|
| Zero-shot | 77.5 | 57.6 | 80.9 | 70.0 |
| PASTA info | 71.1 | 48.8 | 73.6 | 62.2 |
| SpotLight info | 65.5 | 49.4 | 68.4 | 59.4 |
| PASTA* Tuned | 79.9 | 58.5 | 80.8 | 71.0 |
| SpotLight* Tuned | 79.9 | 57.0 | 87.0 | 72.1 |
| DIRECTER | 81.8 | 62.0 | 93.0 | 76.5 |
Applying DIRECTER's plausibility filter significantly improves other methods.
check_circle Fully compatible with FlashAttention
check_circle No pre-computation or extra data required
DIRECTER amplifies specific "Key" vectors in the model's memory to force attention on instruction tokens.
Instead of modifying weights, DIRECTER modifies KV Cache of instruction tokens by amplifying their key vectors. This helps the model's attention mechanism to focus on given constraints during generation.
Think of the model's state as a vector. Steering adds a "Constraint Vector". DIRECTER dynamically adjusts the length of this steering vector based on plausibility, ensuring the final output doesn't deviate too far into incoherence.