latence/privacy-filter-GLiNER-v0.1
latence • generalModel description
GLiNERized privacy-filter built from the OpenAI privacy-filter model. The model uses its encder backbone and swapped the linear token classifier with a GLiNER head to transcend from limited classification of 8 pre-defined PII labels to zero shot capabilities. This provides more versatility in production settings where the set of filtered labels is subject to change during runtime.
Background GLiNER
GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoders (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.
Background Privacy-filter
OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable.
OpenAI Privacy Filter is pretrained autoregressively to arrive at a checkpoint with similar architecture to gpt-oss, albeit of a smaller size. We then converted that checkpoint into a bidirectional token classifier over a privacy label taxonomy, and post-trained with a supervised classification loss. (For architecture details about gpt-oss, please see the gpt-oss model card.) Instead of generating text token-by-token, this model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure. For each input token, the model predicts a probability distribution over the label taxonomy which consists of 8 output categories described below.
Installation & Usage
Install gliner package:
pip install gliner
Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.
from gliner import GLiNER
model = GLiNER.from_pretrained("latence/privacy-filter-GLiNER-v0.1")
text = """
. | Timestamp ( UTC ) | Attributed Sub-Publisher | Click IP Address | Install IP Address | IP Owner / Type | Geolocation Claimed | | - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - -
| | 2023-10-26 14 : 22 : 01 | subpub_xyz789 | 35 . 172 . 110 . 45 | ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` | Amazon AWS | United States | | 2023-10-26 14 : 22 : 34 | subpub_xyz789 | 104 . 198 . 15 . 201 | 104 . 198 . 15 . 201 | Google Cloud | United States | | 2023-10-26 14 : 23 : 11 | subpub_abc123 | 159 . 65 . 92 . 203 | 159 . 65 . 92 . 203 | DigitalOcean | United States | | 2023-10-26 14 : 24 : 05 | subpub_xyz789 | 52 . 54 . 10 . 88 | 52 . 54 . 10 . 88 | Amazon AWS | United States | * * Conclusion : * * The use of datacenter IPs like ` ae1d : dbf7 : 0adf : 7adc : 9fa4 : 8abe : fd32 : d0ff ` ,
which resolves to a known server farm , is a clear violation and proof of NHT . Legitimate users do not install and play mobile games from AWS or Google Cloud servers . The fraudster ' s attempt to spoof the ` France ` is noted but ultimately negated by the IP ownership analysis . # # # # * * 4 . 3 Inconsistent and Fraudulent Device & User Agent Data * * Further analysis of the device-level data for the fraudulent cohort reveals widespread signs of emulation and parameter spoofing .
Genuine devices provide a consistent and logical set of data points . Fraudulent installs often show illogical combinations or repeated identifiers . Please review the following sample records from the fraudulent install list : | Advertising ID ( GAID ) | Device Username | Device User Agent | Notes / Red Flags | | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | | `
AC01601C-7C7B-4C30-B84A-D9F938D8DB38 ` | ` laine . achille ` | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 9 ; SM-G960F Build / PPR1 . 180610 . 011 ) | User Agent is for a Samsung S9 on Android 9 . This GAID has been seen with 15 different User Agents in the last 24 hours . Classic sign of spoofing . | | 2b8c541f-3a21-4f9e-8e3b-1d7c0a9b4f6d | generic_x86 | ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` |
This user agent string for a ' Pixel 4 ' is missing several key tokens , indicating a poorly configured emulator . The ` Mozilla / 5 . 0 ( Windows NT 6 . 2 ; Win64 ; x64 ) AppleWebKit / 578 . 88 ( KHTML , like Gecko ) Chrome / 77 . 3 . 13 . 7 Safari / 572 . 64 Edg / 129 . 5 . 4 . 11 ` string is incomplete . | | c4e5a6b7-8d9c-0f1e-2a3b-4c5d6e7f8a9b | android-user | Dalvik / 2 . 1 . 0 ( Linux ; U ; Android 11 ; sdk_gphone_x86 Build / RSR1 . 201013 . 001 ) | ' sdk_gphone_x86 ' is the default device name for the standard Android Studio emulator . This is clearly not a real user device . | | e5f6a7b8-9c0d-1e2f-3a4b-5c6d7e8f9a0b | ` laine . achille ` | Dalvik / 1 . 6 . 0 ( Linux ; U ; Android 4 . 4 . 2 ; GT-I9505 Build / KOT49H ) | This represents an ancient Android 4 . 4 . 2 device . Our app ' s minimum requirement is Android 6 . 0 . The install should not have been possible"""
labels = ['username', 'google_gaid', 'ip_address', 'user_agent', 'country']
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities[0]:
print(entity["text"], "=>", entity["label"] , "=>", entity["score"])
35 . 172 . 110 . 45 => ip_address => 0.458984375
United States => country => 0.404296875
United States => country => 0.33203125
subpub_abc123 => google_gaid => 0.353515625
159 . 65 . 92 . 203 => ip_address => 0.3984375
159 . 65 . 92 . 203 => ip_address => 0.322265625
52 . 54 . 10 . 88 => ip_address => 0.37890625
52 . 54 . 10 . 88 => ip_address => 0.33984375
France => country => 0.75390625
Training
Training started with a small "GLiNERization" warmup on a general multilingual NER dataset followed by finetuning on a curated PII dataset covering english, german and french. Beside a variety of long-tail pii labels the dataset focuses on 78 GDPR relevant labels.
Input:
Input Type(s): Text
Input Format: UTF-8 string(s)
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: supports structured and unstructured text
Output:
Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: List of dictionaries with keys {text, label, start, end, score}
Software Integration:
Runtime Engine(s):
- PyTorch, GLiNER Python library
Limitations
This is an early checkpoint. The already finetuned encoder backbone and the architecture itself makes the model behave differently than general pretrained DeBerta, MT5 or ModernBert backbones used with a GLiNER head. Benchmarks will follow with later checkpoints.