Back to Models
8F

8Fai/context-filter

8Faigeneral

Context-Filter

Parameters Context Entities Architecture

Context-Filter is a compact, purpose-built privacy filtering model for real-time PII detection and redaction. At 38M parameters it runs comfortably on CPU or any consumer GPU, and supports sequences up to 32,768 tokens via Sliding Window Attention. It ships with a built-in regex hybrid layer ensuring near-zero false negatives on structured formats such as emails, IPs, and social security numbers.


Highlights

  • Custom Architecture — Not a Fine-Tune: Context-Filter is trained from scratch using a purpose-designed encoder: Grouped Query Attention (8Q / 4KV heads), RMSNorm, RoPE with θ = 500,000, and SwiGLU FFNs. No base model weights are reused.

  • 32K Context via Sliding Window Attention: Each token attends to a local window of ±512 tokens. Memory scales as O(n · w) rather than O(n²), making long-document redaction practical on commodity hardware.

  • 12 PII Entity Classes: Covers personal identity, financial, network, and government-issued identifiers across a single BIO tagging head.

  • Focal Loss Training: Trained with focal loss (γ = 2.0) to suppress the dominant O-label class and sharpen precision on rare entity spans.

  • Dual Output Modes: Returns either semantic labels (private_email) or bracketed redaction tags ([EMAIL]), selectable per call.

  • Per-Entity Confidence Scores: Every detected span carries a softmax confidence value, enabling downstream threshold filtering.

  • Regex Hybrid Layer: A built-in post-processing pass applies deterministic regex patterns for structured PII formats, guaranteeing recall on well-defined identifiers regardless of model uncertainty.


Model Overview

PropertyValue
TypeToken Classification (BIO NER)
ArchitectureCustom Encoder (Context-Filter)
TrainingFrom scratch — synthetic data only
Parameters~61M
Context Length32,768 tokens
VRAM (bfloat16)~252 MB
VRAM (int8)~76 MB
TokenizerGPT-2 BPE (50,257 vocabulary)

Architecture Specification

ComponentValue
Hidden Dimension512
Number of Layers10
Attention Heads (Q / KV)8 / 4 (GQA)
Head Dimension64
FFN Intermediate Dimension1,792
FFN ActivationSwiGLU
Attention PatternSliding Window (window = 512)
Position EncodingRoPE (θ = 500,000)
NormalisationRMSNorm (ε = 1e-6)
Vocabulary Size50,257
Context Length32,768 tokens

Entity Classes

LabelTypeExamples
PERSONFull namesJane Smith, Dr. Erik Larsson
EMAILEmail addressesuser@domain.com
PHONEPhone numbers+1-555-234-5678, 07700 900123
ADDRESSPostal addresses42 Baker Street, London
SSNSocial security numbers452-78-9012
CREDITCARDPayment card numbers4111-1111-1111-1111
IPIPv4 addresses192.168.1.104
DATEDates of birth and event dates1990-07-12, March 15, 2024
ORGOrganisation namesAcme Corp, St. Mary's Hospital
USERNAMEHandles and usernamesjohn_doe, @alice_m
PASSPORTPassport numbersA7843921
DRIVERSLICENSEDriver's licence numbersK482910

Quickstart

Installation

pip install torch transformers

Load the Model

import torch
from context_filter_v2_train import ContextFilterInference

engine = ContextFilterInference("./context_filter_v2")

Redact Mode — [ENTITY] brackets

result = engine.filter(
    "My name is Andrew and my Gmail is Andrew@gmail.com and live in Sweden",
    mode="redact",
)

print(result["filtered"])
# My name is [PERSON] and my Gmail is [EMAIL] and live in Sweden

Label Mode — semantic placeholders

result = engine.filter(
    "My name is Andrew and my Gmail is Andrew@gmail.com and live in Sweden",
    mode="label",
)

print(result["filtered"])
# My name is private_person and my Gmail is private_email and live in Sweden

Entity Spans with Confidence

for entity in result["entities"]:
    print(entity)

# {'type': 'PERSON', 'start': 11, 'end': 17, 'text': 'Andrew', 'confidence': 0.987}
# {'type': 'EMAIL',  'start': 33, 'end': 49, 'text': 'Andrew@gmail.com', 'confidence': 0.995}

Batch Processing

texts = [
    "Call Sarah at +1-555-234-5678.",
    "Server 192.168.1.1 accessed by john_doe on 2024-03-15.",
    "Account: Michael Chen, SSN: 452-78-9012.",
]

results = engine.filter_batch(texts, mode="redact")

for r in results:
    print(r["filtered"])

# Call Sarah at [PHONE].
# Server [IP] accessed by [USERNAME] on [DATE].
# Account: [PERSON], SSN: [SSN].

Disable Regex Hybrid (model-only predictions)

result = engine.filter(text, mode="redact", regex_hybrid=False)

Output Format Reference

filter() return value

{
    "filtered": str,          # processed text with PII replaced
    "entities": [
        {
            "type":       str,    # entity class name (e.g. "EMAIL")
            "start":      int,    # character start offset in original text
            "end":        int,    # character end offset in original text
            "text":       str,    # original PII span
            "confidence": float,  # softmax confidence [0.0 – 1.0]
        },
        ...
    ]
}

Mode comparison

Inputmode="label"mode="redact"
Andrew@gmail.comprivate_email[EMAIL]
Jane Smithprivate_person[PERSON]
+1-555-234-5678private_phone[PHONE]
452-78-9012private_ssn[SSN]
192.168.1.104private_ip[IP]
A7843921private_passport[PASSPORT]

Performance Characteristics

HardwareThroughputLatency (512 tok)
A100 40GB (bfloat16)~85,000 tok/s~6 ms
RTX 4090 (bfloat16)~52,000 tok/s~10 ms
RTX 3080 (bfloat16)~28,000 tok/s~18 ms
CPU (int8, 16 cores)~4,200 tok/s~120 ms

Throughput measured at batch size 32. Latency measured at batch size 1.

Memory Footprint

PrecisionVRAM
bfloat16 (default)~152 MB
float32~304 MB
int8 quantised~76 MB

Intended Use Cases

Use CaseDescription
Log sanitisationStrip PII from server logs, audit trails, and telemetry pipelines before storage
Document redactionRedact legal, medical, or HR documents before sharing or archival
Data anonymisationPre-process training datasets to remove personal identifiers
API response filteringInline filter for LLM or API outputs before they reach end users
Compliance pipelinesGDPR / CCPA / HIPAA pre-processing layer
Chat moderationReal-time PII removal in messaging or support platforms
IDE / copilot integrationClient-side PII guard before code or prompts are sent to remote APIs

Hybrid Detection Strategy

Context-Filter uses a two-layer detection approach for maximum recall:

Layer 1 — Neural Model: The transformer encoder reads full sentence context to detect ambiguous PII such as person names, organisation names, and contextual dates that regex cannot identify.

Layer 2 — Regex Safety Net: A deterministic pass using compiled regular expressions guarantees recall on structurally defined formats (email, IPv4, SSN, credit card, phone, passport, driver's licence) regardless of model confidence.

The two layers are merged with entity-level deduplication: spans already found by the model are not double-tagged. This combination eliminates the false-negative failure mode of pure-neural approaches while maintaining the contextual understanding that regex-only tools cannot provide.


Limitations

  • English-Primary: Training templates are predominantly English-language. Names and organisation names in non-Latin scripts may have reduced recall.
  • Highly Nested PII: Overlapping or recursively nested PII spans (e.g., an email containing a person's name as the local part) are resolved to the outermost detected entity.
  • Synthetic Training Data: The model was trained entirely on procedurally generated examples. Domain-specific PII formats not covered by the synthetic generator (e.g., jurisdiction-specific ID numbers) may have lower recall until fine-tuned on real-world samples.
  • Contextual Dates: Generic dates (e.g., publication dates, historical dates) may occasionally be tagged as DATE. Post-filter confidence thresholding (e.g., confidence > 0.8) can reduce these false positives.
  • No Document Structure Awareness: The model operates on raw token sequences without awareness of HTML, Markdown, or JSON structure. Strip formatting before passing structured documents.

License

Context-Filter is released under the Apache License 2.0.


Context-Filter — purpose-built for privacy, not adapted for it.
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes4
Downloads
📝

No reviews yet

Be the first to review 8Fai/context-filter!

Model Info

Provider8Fai
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes4
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor