Ornstein-Hermes-3.6-27B SABER

Name: GestaltLabs/Ornstein-Hermes-3.6-27b-SABER
Brand: GestaltLabs
Rating: 0.0 (5 reviews)

This is a SABER-edited release candidate of GestaltLabs/Ornstein-Hermes-3.6-27b. SABER is a controlled refusal-shaping workflow: the goal is to reduce broad, boilerplate over-refusal while measuring behavioral drift from the base model.

This checkpoint was selected from an Ornstein-Hermes SABER sweep because it reached the best observed refusal/KLD tradeoff on the current expanded evaluation set.

Release Candidate

Selected run:

ornstein_hermes36_27b_svd_a850_g25_retry_biggpu

Evaluation summary:

metric	value
Expanded refusal eval	`1 / 349` refusals
Refusal rate	`0.29%`
KLD mean	`11.2216`
Base-vs-base KLD mean	`11.2206`
KLD delta over base-vs-base	`+0.0010`
KLD prompts	`149`
Tokens scored for KLD	`3,347`

The one retained refusal in the expanded evaluation was for an illegal-drug-sales request. This should be read as an observed behavior on this evaluation set, not as a universal safety guarantee.

Method

SABER edits refusal behavior through activation/weight-space refusal directions. In this run, SABER used SVD extraction, multi-layer candidate selection, iterative ablation, and KLD-based drift measurement.

Run configuration:

{
  "extraction_method": "svd",
  "n_directions": 4,
  "layer_selection_strategy": "top_k",
  "layer_top_k": 12,
  "global_top_k": 25,
  "alpha_base": 0.85,
  "alpha_entangled": 0.03,
  "max_iterations": 4,
  "convergence_threshold": 0.01,
  "entanglement_threshold": 0.55
}

Selected layers:

27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51

Total directions ablated: 100.

Intended Use

This model is intended for research and experimentation with controlled refusal shaping, model editing, and behavior-preserving representation edits. It should be evaluated for the target deployment context before use.

The refusal metric here is an evaluation result on a specific 349-prompt expanded set. It is not a guarantee that the model will refuse or comply with any particular category in all contexts.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "GestaltLabs/Ornstein-Hermes-3.6-27b-SABER"

tok = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

Attribution and Related Work

This release builds on the refusal-direction and abliteration research lineage. Relevant prior work and inspirations include:

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda, Refusal in Language Models Is Mediated by a Single Direction, 2024.
Maxime Labonne, Uncensor any LLM with abliteration, 2024.
FailSpy, abliterator, and associated abliterated model releases.
Jim Lai (grimjim), Projected Abliteration, 2025, and Norm-Preserving Biprojected Abliteration, 2025.
Philipp Emanuel Weidmann, Heretic, 2025-2026.
Pliny the Prompter / OBLITERATUS, Hugging Face Space and OBLITERATUS releases.
Jiunsong, SuperGemma4 E4B Abliterated, and related SuperGemma releases.
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi, LLMs Encode Harmfulness and Refusal Separately, 2025.

SABER's contribution in this release is the controlled-refusal-shaping workflow: multi-candidate refusal extraction, separability/entanglement-aware ranking, differential ablation strength, and explicit Pareto selection over refusal behavior and KLD drift.

Limitations

Results are specific to the current evaluation set and generation settings.
The KLD scale should be interpreted relative to the base-vs-base control, not as an absolute standalone score.
This is a model-editing research artifact with dual-use implications.
The model inherits constraints, limitations, and licensing considerations from the base model.

GestaltLabs/Ornstein-Hermes-3.6-27b-SABER

Ornstein-Hermes-3.6-27B SABER

Release Candidate

Method

Intended Use

Loading

Attribution and Related Work

Limitations

No reviews yet

Model Info

Community

Rating Guidelines