How To Use Presidio On Hugging Face For Easy Automatic PII Detection

Jun 11, 2025 By Tessa Rodriguez

Working with data means being responsible for it—especially when that data contains sensitive information, such as names, phone numbers, or addresses. If you're handling any dataset that includes personal information, even unintentionally, there's a good chance you need a way to identify and manage that exposure. That's where tools like Presidio step in. Built by Microsoft, Presidio focuses on detecting and anonymizing personally identifiable information (PII) using both rule-based and machine-learning techniques. The best part? It's now on Hugging Face.

Let's take a walk through what it's like to experiment with automatic PII detection using Presidio on the Hugging Face Hub—from the basics to setting it up and running it on your data.

What is Presidio and Why It’s Useful on Hugging Face

Presidio works as a modular open-source framework that can identify a wide range of PII entities in text. Whether it's names, credit card numbers, IP addresses, or medical terms, Presidio can detect them using a mix of pattern recognition, regular expressions, and language models.

With Presidio now integrated with the Hugging Face ecosystem, it becomes easier to connect with existing pipelines, transformers, and datasets. You no longer need to install heavy services or write all your detection logic from scratch. Instead, you can experiment right in a familiar environment, test the results, and scale when you’re ready.

Being on Hugging Face also means community support, model sharing, and reproducibility. You can fork someone else’s detection setup, tweak it, and test it against your data with just a few lines of code.

Setting It Up: Presidio on the Hub

To get started, you'll need a working environment with Python and a Hugging Face token if you're working with private models or datasets.

Here’s a simplified setup:

bash

CopyEdit

pip install presidio-analyzer presidio-anonymizer

You’ll also want to load a model from Hugging Face. For example, let’s say you’re using a spaCy-based recognizer or a transformer model trained to detect custom PII types. Hugging Face provides pretrained models or allows you to bring your own.

python

CopyEdit

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

text = "My name is Rachel Green and my email is [email protected]."

results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PERSON"], language='en')

for result in results:

print(result)

This setup detects basic entities. If you need something more tailored, Presidio allows you to plug in your models or regular expressions.

Running PII Detection with Hugging Face Models

Let’s break down how you can enhance Presidio by combining it with models hosted on the Hub.

1. Use Transformer-based Recognizers

If the default recognizers aren’t accurate enough, you can add a transformer-based one. Let’s say you trained a model to detect job titles, and it’s hosted at your-username/pii-job-titles. You can plug it into Presidio as a custom recognizer.

python

CopyEdit

from presidio_analyzer import RecognizerRegistry

from presidio_analyzer.nlp_engine import SpacyNlpEngine

from presidio_analyzer.predefined_recognizers import PatternRecognizer

from transformers import pipeline

custom_pipeline = pipeline("ner", model="your-username/pii-job-titles", aggregation_strategy="simple")

Using "aggregation_strategy" like "simple" or "first" allows the model to group word pieces into full entities. This matters when your data includes names or multi-word terms—like "Chief Technology Officer"—that might otherwise be split.

You'll then wrap this into a recognizer class. The model returns the start and end positions of detected spans, which Presidio uses to anonymize or extract the entity.

python

CopyEdit

class TransformerPIIRecognizer(PatternRecognizer):

def analyze(self, text, entities, nlp_artifacts=None):

results = []

for ent in custom_pipeline(text):

if ent["entity_group"] in entities:

results.append({

"start": ent["start"],

"end": ent["end"],

"entity_type": ent["entity_group"],

"score": ent["score"]

})

return results

You can also filter detections by confidence score. For instance, ignore anything below 0.85 to reduce false positives. This way, your model focuses only on the most certain outputs, which is especially helpful in production settings where over-detection can lead to unnecessary redaction.

2. Add Custom Entities

Let’s say you work in insurance or healthcare and need to detect medical terms, policy IDs, or treatment names. These aren’t typically recognized out of the box, but you can train and deploy a model to Hugging Face and point Presidio to it, much like the previous example.

You can even blend rule-based detection with ML to lower false positives—for instance, using regex to find insurance numbers but double-checking with a model to avoid tagging random sequences.

Testing and Tuning Presidio on Your Data

Once it's set up, you'll want to run Presidio against your real or test dataset. Keep in mind that Presidio returns results as a list of detected entities, along with their corresponding confidence scores. You can visualize, log, or even compare the results with human annotations if you're working on improving a model.

Here’s how you might apply it to a data frame:

python

CopyEdit

import pandas as pd

df = pd.DataFrame({

"text": [

"John Doe lives at 123 Elm Street.",

"Email: [email protected]. Phone: 555-1234"

]

})

def detect_pii(text):

results = analyzer.analyze(text=text, language='en')

return [(res.entity_type, res.start, res.end) for res in results]

df["pii_detected"] = df["text"].apply(detect_pii)

If you need to anonymize the detected entities, use the PresidioAnonymizer:

python

CopyEdit

from presidio_anonymizer import AnonymizerEngine

anonymizer = AnonymizerEngine()

response = anonymizer.anonymize(text=text, analyzer_results=results)

print(response.text)

Now, your text can be safely stored or shared without exposing any personal information.

Conclusion

Presidio's integration with the Hugging Face Hub provides a flexible and extensible way to handle sensitive data in text. Whether you're working on compliance, data labeling, or just want to keep test data clean, automatic PII detection can save a lot of time and reduce risk.

You can keep things simple with built-in recognizers or expand the detection pipeline with transformer models hosted right on the Hub. It’s up to you how far you go, but with the foundation in place, experimenting becomes a lot more accessible—and more productive.

Presidio On Hugging Face Makes PII Detection And Anonymization Simple

What is Presidio and Why It’s Useful on Hugging Face

Setting It Up: Presidio on the Hub

Running PII Detection with Hugging Face Models

1. Use Transformer-based Recognizers

2. Add Custom Entities

Testing and Tuning Presidio on Your Data

Conclusion

Recommended Updates

Five AI Tools Turbocharging LinkedIn Growth: Unlocking Professional Success

One File, One Click: Simplifying LLM Execution with Llamafiles

Quick and Easy Ways to List Files in a Directory with Python

Mastering the Kill Command for Process Control in Linux

Is Apple Fueling the Bullish AI Investment Trend?

The Chatbot Guardrails Arena: A New Approach to AI Alignment

Understanding File Handling in Python the Easy Way

Automation Anywhere Enhances RPA with Generative AI: What You Need to Know

The Ultimate Guide to Real-Time AI ROI Tracking: Boost Your Business Efficiency

The Role of Comments in Python: Different Types and Best Practices

How Is Microsoft Transforming Video Game Development with Its New World AI Model?

Open Source Meets Science: Inside LeMaterial’s Approach to Materials Research