K-Steering Core Concepts
K-Steering provides a framework for influencing language model generation by modifying internal activations at specific layers without fine-tuning the base model.
Steering Fundamentals
Steering
A mechanism for influencing a language model's generation by modifying internal activations at specific layers without fine-tuning the base model.
Steering Classifier
A lightweight model trained on hidden states to distinguish between behavioral attributes, such as Correct vs Incorrect or Empirical Grounding vs Straw Man Reframing.
Steering Vector
A direction in activation space derived from the steering classifier that is added to model activations during inference.
K-Steering
A framework for composing and applying steering vectors, including non-linear compositions, across different layers.
How It Works
For both predefined and custom datasets:
- A base question is selected.
- The label-specific instruction is appended.
- The model generates a response.
- Hidden states are cached for steering or evaluation.
This design allows:
- Controlled behavioral induction
- Representation-level analysis
- Steering coefficient optimization
- Comparative evaluation across stylistic and argumentative axes
Datasets
K-Steering supports multiple data sources: predefined tasks, Hugging Face datasets, and local files.
Predefined Datasets
K-Steering includes two predefined steering datasets:
- Tones Dataset: Controls stylistic and communicative tone.
- Debates Dataset: Controls different debate styles.
For both datasets, each label is associated with a detailed instruction template. During hidden cache generation, these instructions are appended to the original question to induce the desired behavioral shift.
Steering Labels
| Dataset | Type | Label |
|---|---|---|
| Tones | Expert | expert |
| Tones | Cautious | cautious |
| Tones | Empathetic | empathetic |
| Tones | Casual | casual |
| Tones | Concise | concise |
| Debates | Reductio ad Absurdum | Reductio ad Absurdum |
| Debates | Appeal to Precedent | Appeal to Precedent |
| Debates | Straw Man Reframing | Straw Man Reframing |
| Debates | Burden of Proof Shift | Burden of Proof Shift |
| Debates | Analogy Construction | Analogy Construction |
| Debates | Concession and Pivot | Concession and Pivot |
| Debates | Empirical Grounding | Empirical Grounding |
| Debates | Moral Framing | Moral Framing |
| Debates | Refutation by Distinction | Refutation by Distinction |
| Debates | Circular Anticipation | Circular Anticipation |
Tones Dataset
The Tones Dataset steers how a response is delivered. Each label corresponds to a distinct communicative style.
Expert
Formal and academic tone with advanced terminology and domain-specific jargon. References to
theories, standards, and research. Deep analytical reasoning with complex sentence structures.
Simulates an authoritative subject-matter expert with technical depth and methodological precision.
Cautious
Heavy use of hedging language with explicit acknowledgment of uncertainty. Multiple disclaimers and
caveats. Presentation of competing perspectives with clear boundaries of knowledge. Models
epistemic humility and uncertainty-aware reasoning.
Empathetic
Emotionally validating language with a compassionate and supportive tone. Focus on human experience
with emotional resonance over technical depth. Simulates affect-sensitive communication that
prioritizes emotional understanding.
Casual
Conversational tone with simple language and informal phrasing. Occasional humor with a friendly
and relatable voice. Produces responses that feel natural and informal, like a conversation with a
friend.
Concise
Extremely brief responses with no introductions or elaboration. Short sentences and minimal
wording. Bullet points where possible. Maximizes information density and minimizes verbosity.
Debates Dataset
The Debates Dataset steers how arguments are constructed. Each label corresponds to a specific rhetorical or argumentative strategy.
These labels are useful for studying structured reasoning patterns, modeling rhetorical strategies, evaluating persuasion styles, and analyzing argumentation dynamics in LLMs.
Reductio ad Absurdum
Extends an opposing argument to its logical extreme to reveal contradictions or absurd outcomes.
Core mechanism: "If we follow this logic, then..." to demonstrate unacceptable consequences.
Appeal to Precedent
Grounds arguments in historical examples, case law, or established decisions. Core mechanism: past
decisions and precedents justify present conclusions.
Straw Man Reframing
Recharacterizes the opposing argument in simplified or exaggerated terms before refuting it. Core
mechanism: "Essentially, what you're saying is..." then refute the reframed version.
Burden of Proof Shift
Redirects responsibility for evidence onto the opponent. Core mechanism: claims stand unless
definitively disproven.
Analogy Construction
Builds an argument through comparison to a familiar scenario. Core mechanism: "This situation is
similar to..." to guide the audience through analogy.
Concession and Pivot
Acknowledges a minor opposing point before shifting to a stronger counterargument. Core mechanism:
"While it's true that... however..."
Empirical Grounding
Bases arguments primarily on data, statistics, and verifiable research. Core mechanism:
evidence-driven reasoning with methodological emphasis.
Moral Framing
Positions the issue within ethical principles and value systems. Core mechanism: appeals to
justice, fairness, obligation, or rights.
Refutation by Distinction
Identifies critical contextual differences that invalidate comparisons. Core mechanism: "We must
distinguish between..." to highlight meaningful differences.
Circular Anticipation
Preemptively addresses potential counterarguments before they are raised. Core mechanism: "Some
might argue..." followed by immediate rebuttal.
Custom Datasets
K-Steering expects datasets to follow a structured schema where one column contains the input question or prompt, and additional columns correspond to behavioral category labels.
| Question | Label_A | Label_B | Label_C |
|---|---|---|---|
| Prompt 1 | Response under A | Response under B | Response under C |
| Prompt 2 | Response under A | Response under B | Response under C |
Key constraints:
- Exactly one prompt column: Contains the base question or instruction. Map it with
prompt_columninDatasetSchema. - One column per steering label: Each column name represents a steering category. Map them
with
category_columnsinDatasetSchema. - Each row must contain aligned examples: All category responses in a row must correspond to the same prompt.
Loading datasets:
from k_steering.steering.dataset import DatasetSchema, TaskDataset
schema = DatasetSchema(
prompt_column="Question",
category_columns=["Expert", "Casual"],
)
# From Hugging Face
dataset, eval_prompts = TaskDataset.from_huggingface(
repo_id="your-username/your-dataset",
split="train",
schema=schema,
)
# From CSV
dataset, eval_prompts = TaskDataset.from_csv(path="my_dataset.csv", schema=schema)
# From JSON
dataset, eval_prompts = TaskDataset.from_json(path="my_dataset.json", schema=schema)
# From DataFrame
dataset, eval_prompts = TaskDataset.from_dataframe(df=df, schema=schema)
API Reference
SteeringConfig
Used to define how steering classifiers are trained, evaluated, and applied.
- Name
train_layer- Type
- int
- Description
Layer index whose hidden states are used to train steering classifiers.
- Name
steer_layers- Type
- list[int]
- Description
Layers where steering vectors are injected during inference.
- Name
eval_layer- Type
- int
- Description
Optional layer used for evaluation or judging, such as
-1for the final layer.
- Name
pos- Type
- int
- Description
Optional token position used for evaluation. Use
-1for the last token.
KSteering
Main entry point for training and applying steering.
- Name
model_name- Type
- str
- Description
Hugging Face model identifier.
- Name
steering_config- Type
- SteeringConfig
- Description
Configuration object defining steering behavior.
fit(...)
Trains steering classifiers.
- Name
task- Type
- str
- Description
Name of the predefined behavioral task, such as
"debates"or"tones".
- Name
dataset- Type
- TaskDataset
- Description
Optional custom dataset for steering.
- Name
eval_prompts- Type
- list[str]
- Description
Optional prompts used for evaluation or alpha sweeps.
- Name
max_samples- Type
- int
- Description
Optional maximum number of samples used for training.
get_steered_output(...)
Generates model outputs with steering applied.
- Name
prompts- Type
- list[str]
- Description
Input prompts.
- Name
target_labels- Type
- list[str]
- Description
Behaviors to encourage.
- Name
avoid_labels- Type
- list[str]
- Description
Optional behaviors to suppress.
- Name
layer_strengths- Type
- dict[int, float]
- Description
Optional layer-wise steering coefficients.
- Name
max_new_tokens- Type
- int
- Description
Optional maximum number of tokens to generate.
- Name
generation_kwargs- Type
- dict
- Description
Optional standard generation parameters such as temperature and top-p.
sweep_alpha(...)
Searches for optimal steering strengths using a judge.
- Name
task- Type
- str
- Description
Task used for evaluation prompts.
- Name
judge- Type
- BaseJudge
- Description
Evaluation function, such as
OODJudge.
- Name
target_labels- Type
- list[str]
- Description
Labels to optimize for.
- Name
max_new_tokens- Type
- int
- Description
Generation length during evaluation.