K-Steering Core Concepts

K-Steering provides a framework for influencing language model generation by modifying internal activations at specific layers without fine-tuning the base model.

Steering Fundamentals

Steering

A mechanism for influencing a language model's generation by modifying internal activations at specific layers without fine-tuning the base model.

Steering Classifier

A lightweight model trained on hidden states to distinguish between behavioral attributes, such as Correct vs Incorrect or Empirical Grounding vs Straw Man Reframing.

Steering Vector

A direction in activation space derived from the steering classifier that is added to model activations during inference.

K-Steering

A framework for composing and applying steering vectors, including non-linear compositions, across different layers.

How It Works

For both predefined and custom datasets:

A base question is selected.
The label-specific instruction is appended.
The model generates a response.
Hidden states are cached for steering or evaluation.

This design allows:

Controlled behavioral induction
Representation-level analysis
Steering coefficient optimization
Comparative evaluation across stylistic and argumentative axes

Datasets

K-Steering supports multiple data sources: predefined tasks, Hugging Face datasets, and local files.

Predefined Datasets

K-Steering includes two predefined steering datasets:

Tones Dataset: Controls stylistic and communicative tone.
Debates Dataset: Controls different debate styles.

For both datasets, each label is associated with a detailed instruction template. During hidden cache generation, these instructions are appended to the original question to induce the desired behavioral shift.

Steering Labels

Dataset	Type	Label
Tones	Expert	`expert`
Tones	Cautious	`cautious`
Tones	Empathetic	`empathetic`
Tones	Casual	`casual`
Tones	Concise	`concise`
Debates	Reductio ad Absurdum	`Reductio ad Absurdum`
Debates	Appeal to Precedent	`Appeal to Precedent`
Debates	Straw Man Reframing	`Straw Man Reframing`
Debates	Burden of Proof Shift	`Burden of Proof Shift`
Debates	Analogy Construction	`Analogy Construction`
Debates	Concession and Pivot	`Concession and Pivot`
Debates	Empirical Grounding	`Empirical Grounding`
Debates	Moral Framing	`Moral Framing`
Debates	Refutation by Distinction	`Refutation by Distinction`
Debates	Circular Anticipation	`Circular Anticipation`

Tones Dataset

The Tones Dataset steers how a response is delivered. Each label corresponds to a distinct communicative style.

Expert
Formal and academic tone with advanced terminology and domain-specific jargon. References to theories, standards, and research. Deep analytical reasoning with complex sentence structures. Simulates an authoritative subject-matter expert with technical depth and methodological precision.

Cautious
Heavy use of hedging language with explicit acknowledgment of uncertainty. Multiple disclaimers and caveats. Presentation of competing perspectives with clear boundaries of knowledge. Models epistemic humility and uncertainty-aware reasoning.

Empathetic
Emotionally validating language with a compassionate and supportive tone. Focus on human experience with emotional resonance over technical depth. Simulates affect-sensitive communication that prioritizes emotional understanding.

Casual
Conversational tone with simple language and informal phrasing. Occasional humor with a friendly and relatable voice. Produces responses that feel natural and informal, like a conversation with a friend.

Concise
Extremely brief responses with no introductions or elaboration. Short sentences and minimal wording. Bullet points where possible. Maximizes information density and minimizes verbosity.

Debates Dataset

The Debates Dataset steers how arguments are constructed. Each label corresponds to a specific rhetorical or argumentative strategy.

These labels are useful for studying structured reasoning patterns, modeling rhetorical strategies, evaluating persuasion styles, and analyzing argumentation dynamics in LLMs.

Reductio ad Absurdum
Extends an opposing argument to its logical extreme to reveal contradictions or absurd outcomes. Core mechanism: "If we follow this logic, then..." to demonstrate unacceptable consequences.

Appeal to Precedent
Grounds arguments in historical examples, case law, or established decisions. Core mechanism: past decisions and precedents justify present conclusions.

Straw Man Reframing
Recharacterizes the opposing argument in simplified or exaggerated terms before refuting it. Core mechanism: "Essentially, what you're saying is..." then refute the reframed version.

Burden of Proof Shift
Redirects responsibility for evidence onto the opponent. Core mechanism: claims stand unless definitively disproven.

Analogy Construction
Builds an argument through comparison to a familiar scenario. Core mechanism: "This situation is similar to..." to guide the audience through analogy.

Concession and Pivot
Acknowledges a minor opposing point before shifting to a stronger counterargument. Core mechanism: "While it's true that... however..."

Empirical Grounding
Bases arguments primarily on data, statistics, and verifiable research. Core mechanism: evidence-driven reasoning with methodological emphasis.

Moral Framing
Positions the issue within ethical principles and value systems. Core mechanism: appeals to justice, fairness, obligation, or rights.

Refutation by Distinction
Identifies critical contextual differences that invalidate comparisons. Core mechanism: "We must distinguish between..." to highlight meaningful differences.

Circular Anticipation
Preemptively addresses potential counterarguments before they are raised. Core mechanism: "Some might argue..." followed by immediate rebuttal.

Custom Datasets

K-Steering expects datasets to follow a structured schema where one column contains the input question or prompt, and additional columns correspond to behavioral category labels.

Question	Label_A	Label_B	Label_C
Prompt 1	Response under A	Response under B	Response under C
Prompt 2	Response under A	Response under B	Response under C

Key constraints:

Exactly one prompt column: Contains the base question or instruction. Map it with prompt_column in DatasetSchema.
One column per steering label: Each column name represents a steering category. Map them with category_columns in DatasetSchema.
Each row must contain aligned examples: All category responses in a row must correspond to the same prompt.

Loading datasets:

from k_steering.steering.dataset import DatasetSchema, TaskDataset

schema = DatasetSchema(
    prompt_column="Question",
    category_columns=["Expert", "Casual"],
)

# From Hugging Face
dataset, eval_prompts = TaskDataset.from_huggingface(
    repo_id="your-username/your-dataset",
    split="train",
    schema=schema,
)

# From CSV
dataset, eval_prompts = TaskDataset.from_csv(path="my_dataset.csv", schema=schema)

# From JSON
dataset, eval_prompts = TaskDataset.from_json(path="my_dataset.json", schema=schema)

# From DataFrame
dataset, eval_prompts = TaskDataset.from_dataframe(df=df, schema=schema)

API Reference

SteeringConfig

Used to define how steering classifiers are trained, evaluated, and applied.

Name
train_layer
Type
int
Description
Layer index whose hidden states are used to train steering classifiers.
Name
steer_layers
Type
list[int]
Description
Layers where steering vectors are injected during inference.
Name
eval_layer
Type
int
Description
Optional layer used for evaluation or judging, such as -1 for the final layer.
Name
pos
Type
int
Description
Optional token position used for evaluation. Use -1 for the last token.

KSteering

Main entry point for training and applying steering.

Name
model_name
Type
str
Description
Hugging Face model identifier.
Name
steering_config
Type
SteeringConfig
Description
Configuration object defining steering behavior.

fit(...)

Trains steering classifiers.

Name
task
Type
str
Description
Name of the predefined behavioral task, such as "debates" or "tones".
Name
dataset
Type
TaskDataset
Description
Optional custom dataset for steering.
Name
eval_prompts
Type
list[str]
Description
Optional prompts used for evaluation or alpha sweeps.
Name
max_samples
Type
int
Description
Optional maximum number of samples used for training.

get_steered_output(...)

Generates model outputs with steering applied.

Name
prompts
Type
list[str]
Description
Input prompts.
Name
target_labels
Type
list[str]
Description
Behaviors to encourage.
Name
avoid_labels
Type
list[str]
Description
Optional behaviors to suppress.
Name
layer_strengths
Type
dict[int, float]
Description
Optional layer-wise steering coefficients.
Name
max_new_tokens
Type
int
Description
Optional maximum number of tokens to generate.
Name
generation_kwargs
Type
dict
Description
Optional standard generation parameters such as temperature and top-p.

sweep_alpha(...)

Searches for optimal steering strengths using a judge.

Name
task
Type
str
Description
Task used for evaluation prompts.
Name
judge
Type
BaseJudge
Description
Evaluation function, such as OODJudge.
Name
target_labels
Type
list[str]
Description
Labels to optimize for.
Name
max_new_tokens
Type
int
Description
Generation length during evaluation.