Back to Models
CH

chopratejas/kompress-base

chopratejasgeneral

Kompress: ModernBERT Token Compressor for LLM Context Windows

Kompress compresses text in LLM context windows so agents can do more with less. It's a drop-in replacement for LLMLingua-2 that's higher quality and 2.3x faster.

Results

ModelQualityLatencySizeParams
kompress-base7.9/1084ms (MPS)600MB150M
kompress-small7.4/1013-29ms (ONNX)279MB70M
LLMLingua-25.9/10117ms710MB179M

Quality on Real Agent Data (Claude-judged)

Eval Setkompress-basekompress-smallLLMLingua-2
Unstructured NL text7.9/107.4/105.9/10
Claude Code sessions7.3/107.4/106.2/10

Quality scores are judged by Claude Sonnet 4.6: "Can an LLM fully understand and act on the compressed version?" (1-10 scale).

How It Works

Kompress is a dual-head ModernBERT model trained to classify each token as keep or discard:

  • Token head: Binary classifier (keep/discard per token via argmax)
  • Span head: 1D CNN that identifies important regions, boosts borderline tokens in critical spans

The model decides how much to compress based on content density — no fixed compression ratio.

Example

ORIGINAL (98 words):
After investigating the memory leak, I traced it to the event listener
registration in the WebSocket handler. Every time a client connects, we
register a new listener on the global event bus, but when the client
disconnects, the cleanup function only removes the WebSocket connection
from the pool — it doesn't unregister the event listener. Over time,
these orphaned listeners accumulate and each one holds a reference to
the connection's closure, which in turn holds the entire request context.
The fix is straightforward: store the listener reference at connection
time and explicitly remove it in the disconnect handler.

COMPRESSED (59 words, 60% kept):
investigating memory leak, traced event listener registration WebSocket
handler. Every time client connects, register new listener global event
bus, client disconnects, cleanup function only removes WebSocket
connection pool — doesn't unregister event listener. Over time, orphaned
listeners accumulate each one holds reference connection's closure, holds
entire request context. fix straightforward: store listener reference
connection time explicitly remove disconnect handler.

An LLM can fully understand and act on the compressed version.

Usage

from kompress.inference.pytorch_runner import KompressRunner

# Auto-downloads from HuggingFace on first use
runner = KompressRunner()

result = runner.compress("Your long text here...")
print(result.compressed)       # Compressed text
print(result.compression_ratio) # e.g., 0.62
print(result.tokens_saved)      # Number of tokens saved

With Headroom (LLM Proxy)

pip install headroom-ai

Kompress is built into Headroom as the default text compressor. It auto-downloads and runs on every API request that passes through the proxy.

Training

Architecture

  • Base: answerdotai/ModernBERT-base (149M params, 8192 token context)
  • Token head: Linear(768, 2) — binary keep/discard classifier
  • Span head: Conv1d(768→256, k=5) → GELU → Conv1d(256→1, k=3) → Sigmoid
  • Total: 150M params

Data

215K extractive compression labels from 8 diverse datasets, labeled by Claude Sonnet 4.6:

DatasetCountType
LMSYS-Chat-1M57KLLM conversations
CNN/DailyMail50KNews articles
WikiHow50KHow-to guides
MeetingBank50KMeeting transcripts
XSum47KNews articles
GovReport25KGovernment reports
ArXiv25KAcademic papers
SAMSum14KDialogues

Labeling Approach

Key insight: the labels must be strictly extractive — a subset of original words in original order. Previous versions failed because the labeling LLM rephrased text, causing alignment failures (5-12% keep ratio instead of the intended 40-60%).

The fix: prompt Claude to "select words like highlighting with a marker" rather than "compress this text." This ensures every word in the compressed output exists in the original, and the greedy alignment recovers 95%+ of the intended labels.

Training Details

  • 3 epochs, batch size 32, learning rate 2e-5
  • BF16 mixed precision on NVIDIA H100
  • HuggingFace Trainer with warmup + cosine schedule
  • ~3 hours training time

Model Family

kompress-basekompress-smallLLMLingua-2
ArchitectureModernBERT 22-layerModernBERT 6-layer (distilled)mBERT (2018)
Params150M70M179M
Size600MB279MB (ONNX: 275MB)710MB
Max context8,192 tokens8,192 tokens512 tokens
Quality7.9/107.4/105.9/10
Latency84ms (MPS)13-29ms (ONNX)117ms
Training data215K from 8 datasetsDistilled from base41K from MeetingBank
Labeling modelClaude Sonnet 4.6GPT-4
CompressionContent-adaptiveContent-adaptiveFixed ratio

Limitations

  • English only (ModernBERT is English-focused)
  • Best on natural language text; structured data (JSON, code, logs) should use specialized compressors
  • Compression ratio varies by content (60-80% kept for dense text, 40-60% for verbose text)

License

Apache 2.0

Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes5
Downloads
📝

No reviews yet

Be the first to review chopratejas/kompress-base!

Model Info

Providerchopratejas
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes5
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor