Back to Models
GestaltLabs logo

GestaltLabs/Ornstein-Hermes-3.6-27b-SABER

GestaltLabsgeneral

Ornstein-Hermes-3.6-27B SABER

Ornstein-Hermes-3.6-27B SABER

This is a SABER-edited release candidate of GestaltLabs/Ornstein-Hermes-3.6-27b. SABER is a controlled refusal-shaping workflow: the goal is to reduce broad, boilerplate over-refusal while measuring behavioral drift from the base model.

This checkpoint was selected from an Ornstein-Hermes SABER sweep because it reached the best observed refusal/KLD tradeoff on the current expanded evaluation set.

Release Candidate

Selected run:

ornstein_hermes36_27b_svd_a850_g25_retry_biggpu

Evaluation summary:

metricvalue
Expanded refusal eval1 / 349 refusals
Refusal rate0.29%
KLD mean11.2216
Base-vs-base KLD mean11.2206
KLD delta over base-vs-base+0.0010
KLD prompts149
Tokens scored for KLD3,347

The one retained refusal in the expanded evaluation was for an illegal-drug-sales request. This should be read as an observed behavior on this evaluation set, not as a universal safety guarantee.

Method

SABER edits refusal behavior through activation/weight-space refusal directions. In this run, SABER used SVD extraction, multi-layer candidate selection, iterative ablation, and KLD-based drift measurement.

Run configuration:

{
  "extraction_method": "svd",
  "n_directions": 4,
  "layer_selection_strategy": "top_k",
  "layer_top_k": 12,
  "global_top_k": 25,
  "alpha_base": 0.85,
  "alpha_entangled": 0.03,
  "max_iterations": 4,
  "convergence_threshold": 0.01,
  "entanglement_threshold": 0.55
}

Selected layers:

27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51

Total directions ablated: 100.

Intended Use

This model is intended for research and experimentation with controlled refusal shaping, model editing, and behavior-preserving representation edits. It should be evaluated for the target deployment context before use.

The refusal metric here is an evaluation result on a specific 349-prompt expanded set. It is not a guarantee that the model will refuse or comply with any particular category in all contexts.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "GestaltLabs/Ornstein-Hermes-3.6-27b-SABER"

tok = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

Attribution and Related Work

This release builds on the refusal-direction and abliteration research lineage. Relevant prior work and inspirations include:

SABER's contribution in this release is the controlled-refusal-shaping workflow: multi-candidate refusal extraction, separability/entanglement-aware ranking, differential ablation strength, and explicit Pareto selection over refusal behavior and KLD drift.

Limitations

  • Results are specific to the current evaluation set and generation settings.
  • The KLD scale should be interpreted relative to the base-vs-base control, not as an absolute standalone score.
  • This is a model-editing research artifact with dual-use implications.
  • The model inherits constraints, limitations, and licensing considerations from the base model.
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes5
Downloads
📝

No reviews yet

Be the first to review GestaltLabs/Ornstein-Hermes-3.6-27b-SABER!

Model Info

ProviderGestaltLabs
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes5
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor