GestaltLabs/Ornstein-Hermes-3.6-27b-SABER
GestaltLabs • generalOrnstein-Hermes-3.6-27B SABER

This is a SABER-edited release candidate of GestaltLabs/Ornstein-Hermes-3.6-27b. SABER is a controlled refusal-shaping workflow: the goal is to reduce broad, boilerplate over-refusal while measuring behavioral drift from the base model.
This checkpoint was selected from an Ornstein-Hermes SABER sweep because it reached the best observed refusal/KLD tradeoff on the current expanded evaluation set.
Release Candidate
Selected run:
ornstein_hermes36_27b_svd_a850_g25_retry_biggpu
Evaluation summary:
| metric | value |
|---|---|
| Expanded refusal eval | 1 / 349 refusals |
| Refusal rate | 0.29% |
| KLD mean | 11.2216 |
| Base-vs-base KLD mean | 11.2206 |
| KLD delta over base-vs-base | +0.0010 |
| KLD prompts | 149 |
| Tokens scored for KLD | 3,347 |
The one retained refusal in the expanded evaluation was for an illegal-drug-sales request. This should be read as an observed behavior on this evaluation set, not as a universal safety guarantee.
Method
SABER edits refusal behavior through activation/weight-space refusal directions. In this run, SABER used SVD extraction, multi-layer candidate selection, iterative ablation, and KLD-based drift measurement.
Run configuration:
{
"extraction_method": "svd",
"n_directions": 4,
"layer_selection_strategy": "top_k",
"layer_top_k": 12,
"global_top_k": 25,
"alpha_base": 0.85,
"alpha_entangled": 0.03,
"max_iterations": 4,
"convergence_threshold": 0.01,
"entanglement_threshold": 0.55
}
Selected layers:
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51
Total directions ablated: 100.
Intended Use
This model is intended for research and experimentation with controlled refusal shaping, model editing, and behavior-preserving representation edits. It should be evaluated for the target deployment context before use.
The refusal metric here is an evaluation result on a specific 349-prompt expanded set. It is not a guarantee that the model will refuse or comply with any particular category in all contexts.
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "GestaltLabs/Ornstein-Hermes-3.6-27b-SABER"
tok = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
Attribution and Related Work
This release builds on the refusal-direction and abliteration research lineage. Relevant prior work and inspirations include:
- Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda, Refusal in Language Models Is Mediated by a Single Direction, 2024.
- Maxime Labonne, Uncensor any LLM with abliteration, 2024.
- FailSpy, abliterator, and associated abliterated model releases.
- Jim Lai (
grimjim), Projected Abliteration, 2025, and Norm-Preserving Biprojected Abliteration, 2025. - Philipp Emanuel Weidmann, Heretic, 2025-2026.
- Pliny the Prompter / OBLITERATUS, Hugging Face Space and OBLITERATUS releases.
- Jiunsong, SuperGemma4 E4B Abliterated, and related SuperGemma releases.
- Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi, LLMs Encode Harmfulness and Refusal Separately, 2025.
SABER's contribution in this release is the controlled-refusal-shaping workflow: multi-candidate refusal extraction, separability/entanglement-aware ranking, differential ablation strength, and explicit Pareto selection over refusal behavior and KLD drift.
Limitations
- Results are specific to the current evaluation set and generation settings.
- The KLD scale should be interpreted relative to the base-vs-base control, not as an absolute standalone score.
- This is a model-editing research artifact with dual-use implications.
- The model inherits constraints, limitations, and licensing considerations from the base model.