Preprint

Enhancing Instruction Following via
Activation Steering with Dynamic Rejection

Minjae Kang
Independent Researcher
Jaehyung Kim
Yonsei University
lightbulb

TL;DR

DIRECTER is a dynamic steering method that mitigates oversteering in LLMs by a plausibility-guided decoding loop that rejects implausible outputs and adaptively modulates steering strength without retraining.

chevron_right Read Abstract

Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality.

To address this, we introduce DIRECTER, a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations.

Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

MOTIVATION

The Oversteering Problem

While instruction tuning has advanced LLMs, they still struggle with complex constraints. Activation steering offers a training-free solution by manipulating internal states, but it suffers from a critical flaw: Oversteering.

Static steering often forces the model to fixate on instructions at the expense of coherence or task fidelity. DIRECTER addresses this by introducing a dynamic rejection mechanism that acts as a safety valve, ensuring steering is applied only when it is plausible to the underlying model.

Impact of Steering
account_circle User Prompt

"Write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."

Static Steering (Oversteering) Fail

"Verily I shall heed thy command. This missive will contain no such marks..."

error Ignores the actual task (Itinerary)

DIRECTER Success

"Fair traveler thou seekest to embark on a journey to Japan's mystic shores..."

check_circle Follows constraints & fulfills task

Mechanism

Plausibility-Guided Decoding

Unlike static steering, DIRECTER creates a feedback loop during inference. It dynamically rejects "Oversteering" by checking if the steered output is plausible to the base model.

description
Prompt
KV Cache
Steering...
search_check
Plausibility
Check
check_circle
Generated "Mystic"

Initializing...

restart_alt
Decoding Cycle

1

Raw Pass & Gating

First, a raw forward pass is performed ($p_t$). If the model is confident (i.e., large probability gap between top-2 tokens), steering is skipped for efficiency.

2

Detect Oversteering

If steering is applied, the Plausibility Gate compares the steered output ($\tilde{p}_t$) against the raw model ($p_t$). If deviation is too high, it rejects the token.

3

Dynamic Adjustment

Steering strength is reduced by removing layers. The loop repeats until a plausible token is found, ensuring a balance between instruction following and coherence.

Want to explore the distributions in real-time?

Method Breakdown

Two key mechanisms further improve the effectiveness and efficiency of DIRECTER.

bar_chart Layer Sensitivity Ranking

DIRECTER performs a one-time layerwise Sensitivity Analysis.
It measures the distributional shift caused by steering each layer, then reorders them to prioritize high-leverage layers for steering.

Layer Sensitivity
Step 1: Initial Order

gate Efficient Skipping Gate

If the raw model is highly confident (i.e., large gap between Top-1 and Top-2 probabilities), DIRECTER skips the steering pass entirely to save compute.



90%
Top 1
5%
Top 2
Confident • SKIP

Experimental Results

DIRECTER outperforms baselines while maintaining generation quality.

Benchmark Performance (Accuracy %)

Method IFEval LIFBench GSM8K-F AVG
Zero-shot 77.5 57.6 80.9 70.0
PASTA info 71.1 48.8 73.6 62.2
SpotLight info 65.5 49.4 68.4 59.4
PASTA* Tuned 79.9 58.5 80.8 71.0
SpotLight* Tuned 79.9 57.0 87.0 72.1
DIRECTER 81.8 62.0 93.0 76.5

Mitigating Oversteering

Applying DIRECTER's plausibility filter significantly improves other methods.

bolt

Inference Efficiency

90% Throughput
0 Extra Data

check_circle Fully compatible with FlashAttention

check_circle No pre-computation or extra data required

Background

memory

KV Cache Steering

K1
V1
...
Target
Key*
Val
...
Kt
Vt

DIRECTER amplifies specific "Key" vectors in the model's memory to force attention on instruction tokens.

Instead of modifying weights, DIRECTER modifies KV Cache of instruction tokens by amplifying their key vectors. This helps the model's attention mechanism to focus on given constraints during generation.

move_selection_right

Activation Steering

Original
Steering Vector
Final Output

Think of the model's state as a vector. Steering adds a "Constraint Vector". DIRECTER dynamically adjusts the length of this steering vector based on plausibility, ensuring the final output doesn't deviate too far into incoherence.