Back to Models
II

iiiorg/piiranha-v1-detect-personal-information

iiiorggeneral

Contact

william (at) integrinet [dot] org

Piiranha-v1: Protect your personal information!

Open In Colab

Piiranha (cc-by-nc-nd-4.0 license) is trained to detect 17 types of Personally Identifiable Information (PII) across six languages. It successfully catches 98.27% of PII tokens, with an overall classification accuracy of 99.44%. Piiranha is especially accurate at detecting passwords, emails (100%), phone numbers, and usernames.

Performance on PII vs. Non PII classification task:

  • Precision: 98.48% (98.48% of tokens classified as PII are actually PII)
  • Recall: 98.27% (correctly identifies 98.27% of PII tokens)
  • Specificity: 99.84% (correctly identifies 99.84% of Non PII tokens)
Akash Network logo

Piiranha was trained on H100 GPUs generously sponsored by the Akash Network

Total downloads: 1.118 million and counting!

Model Description

Piiranha is a fine-tuned version of microsoft/mdeberta-v3-base. The context length is 256 Deberta tokens. If your text is longer than that, just split it up.

Supported languages: English, Spanish, French, German, Italian, Dutch

Supported PII types: Account Number, Building Number, City, Credit Card Number, Date of Birth, Driver's License, Email, First Name, Last Name, ID Card, Password, Social Security Number, Street Address, Tax Number, Phone Number, Username, Zipcode.

It achieves the following results on a test set of ~73,000 sentences containing PII:

  • Accuracy: 99.44%
  • Loss: 0.0173
  • Precision: 93.16%
  • Recall: 93.08%
  • F1: 93.12%

Note that the above metrics factor in the eighteen possible categories (17 PII and 1 Non PII), so the metrics are lower than the metrics for just PII vs. Non PII (binary classification).

Performance by PII type

Reported performance metrics are lower than the overall accuracy of 99.44% due to class imbalance (most tokens are not PII). However, the model is more useful than the below results suggest, due to the intent behind PII detection. The model sometimes misclassifies one PII type for another, but at the end of the day, it still recognizes the token as PII. For instance, the model often confuses first names for last names, but that's fine because it still flags the name as PII.

EntityPrecisionRecallF1-ScoreSupport
ACCOUNTNUM0.840.870.853575
BUILDINGNUM0.920.900.913252
CITY0.950.970.967270
CREDITCARDNUMBER0.940.960.952308
DATEOFBIRTH0.930.850.893389
DRIVERLICENSENUM0.960.960.962244
EMAIL1.001.001.006892
GIVENNAME0.870.930.9012150
IDCARDNUM0.890.940.913700
PASSWORD0.980.980.982387
SOCIALNUM0.930.940.932709
STREET0.970.950.963331
SURNAME0.890.780.838267
TAXNUM0.970.890.932322
TELEPHONENUM0.991.000.995039
USERNAME0.980.980.987680
ZIPCODE0.940.970.953191
micro avg0.930.930.9379706
macro avg0.940.930.9379706
weighted avg0.930.930.9379706

Intended uses & limitations

Piiranha can be used to assist with redacting PII from texts. Use at your own risk. We do not accept responsibility for any incorrect model predictions.

Training and evaluation data

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 128
  • eval_batch_size: 128
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.05
  • num_epochs: 5
  • mixed_precision_training: Native AMP

Training results

Training LossEpochStepValidation LossPrecisionRecallF1Accuracy
0.29840.09832500.10050.54460.61110.57590.9702
0.05680.19655000.04640.78950.84590.81670.9849
0.04410.29487500.04000.83460.86690.85040.9869
0.03680.393110000.03200.85310.87840.86560.9891
0.03230.491412500.02930.87790.88890.88340.9903
0.02870.589615000.02690.89190.88360.88770.9907
0.02820.687917500.02760.87240.90120.88660.9903
0.02680.786220000.02540.88900.90410.89650.9914
0.02640.884422500.02360.88860.90400.89620.9915
0.02430.982725000.02320.89980.90330.90150.9917
0.02131.081027500.02370.91150.90400.90770.9923
0.02131.179230000.02220.91230.91430.91330.9925
0.02171.277532500.02220.89990.91690.90830.9924
0.02091.375835000.02120.91110.91330.91220.9928
0.02041.474137500.02060.90540.92030.91280.9926
0.01831.572340000.02120.91260.91600.91430.9927
0.01911.670642500.01920.91220.91920.91570.9929
0.01851.768945000.01950.92000.91910.91960.9932
0.0181.867147500.01880.91360.92150.91760.9933
0.01831.965450000.01910.91790.92120.91960.9934
0.01472.063752500.01880.92460.92420.92440.9937
0.01492.161955000.01840.91880.92540.92210.9937
0.01432.260257500.01930.91870.92240.92050.9932
0.0142.358560000.01900.92460.92800.92630.9936
0.01462.456862500.01900.92250.92770.92510.9936
0.01482.555065000.01750.92970.93060.93010.9942
0.01362.653367500.01720.91910.93290.92590.9938
0.01372.751670000.01660.92990.93120.93060.9942
0.0142.849872500.01670.92850.93130.92990.9942
0.01282.948175000.01660.92710.93260.92980.9943
0.01133.046477500.01710.92860.93470.93160.9946
0.01033.144780000.01720.92840.93830.93340.9945
0.01043.242982500.01690.93120.94060.93590.9947
0.00943.341285000.01660.93680.93590.93640.9948
0.013.439587500.01660.92890.93870.93370.9944
0.00993.537790000.01620.93350.93320.93340.9947
0.00993.636092500.01600.93210.93800.93500.9947
0.013.734395000.01680.93060.93890.93470.9947
0.01013.832597500.01590.93390.93500.93440.9947

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.0
  • Tokenizers 0.19.1
Visit Website

0 reviews

5
0
4
0
3
0
2
0
1
0
Likes239
Downloads
📝

No reviews yet

Be the first to review iiiorg/piiranha-v1-detect-personal-information!

Model Info

Provideriiiorg
Categorygeneral
Reviews0
Avg. Rating / 5.0

Community

Likes239
Downloads

Rating Guidelines

★★★★★Exceptional
★★★★Great
★★★Good
★★Fair
Poor