Presidio On Hugging Face Makes PII Detection And Anonymization Simple

Advertisement

Jun 11, 2025 By Tessa Rodriguez

Working with data means being responsible for it—especially when that data contains sensitive information, such as names, phone numbers, or addresses. If you're handling any dataset that includes personal information, even unintentionally, there's a good chance you need a way to identify and manage that exposure. That's where tools like Presidio step in. Built by Microsoft, Presidio focuses on detecting and anonymizing personally identifiable information (PII) using both rule-based and machine-learning techniques. The best part? It's now on Hugging Face.

Let's take a walk through what it's like to experiment with automatic PII detection using Presidio on the Hugging Face Hub—from the basics to setting it up and running it on your data.

What is Presidio and Why It’s Useful on Hugging Face

Presidio works as a modular open-source framework that can identify a wide range of PII entities in text. Whether it's names, credit card numbers, IP addresses, or medical terms, Presidio can detect them using a mix of pattern recognition, regular expressions, and language models.

With Presidio now integrated with the Hugging Face ecosystem, it becomes easier to connect with existing pipelines, transformers, and datasets. You no longer need to install heavy services or write all your detection logic from scratch. Instead, you can experiment right in a familiar environment, test the results, and scale when you’re ready.

Being on Hugging Face also means community support, model sharing, and reproducibility. You can fork someone else’s detection setup, tweak it, and test it against your data with just a few lines of code.

Setting It Up: Presidio on the Hub

To get started, you'll need a working environment with Python and a Hugging Face token if you're working with private models or datasets.

Here’s a simplified setup:

bash

CopyEdit

pip install presidio-analyzer presidio-anonymizer

You’ll also want to load a model from Hugging Face. For example, let’s say you’re using a spaCy-based recognizer or a transformer model trained to detect custom PII types. Hugging Face provides pretrained models or allows you to bring your own.

python

CopyEdit

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

text = "My name is Rachel Green and my email is [email protected]."

results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PERSON"], language='en')

for result in results:

print(result)

This setup detects basic entities. If you need something more tailored, Presidio allows you to plug in your models or regular expressions.

Running PII Detection with Hugging Face Models

Let’s break down how you can enhance Presidio by combining it with models hosted on the Hub.

1. Use Transformer-based Recognizers

If the default recognizers aren’t accurate enough, you can add a transformer-based one. Let’s say you trained a model to detect job titles, and it’s hosted at your-username/pii-job-titles. You can plug it into Presidio as a custom recognizer.

python

CopyEdit

from presidio_analyzer import RecognizerRegistry

from presidio_analyzer.nlp_engine import SpacyNlpEngine

from presidio_analyzer.predefined_recognizers import PatternRecognizer

from transformers import pipeline

custom_pipeline = pipeline("ner", model="your-username/pii-job-titles", aggregation_strategy="simple")

Using "aggregation_strategy" like "simple" or "first" allows the model to group word pieces into full entities. This matters when your data includes names or multi-word terms—like "Chief Technology Officer"—that might otherwise be split.

You'll then wrap this into a recognizer class. The model returns the start and end positions of detected spans, which Presidio uses to anonymize or extract the entity.

python

CopyEdit

class TransformerPIIRecognizer(PatternRecognizer):

def analyze(self, text, entities, nlp_artifacts=None):

results = []

for ent in custom_pipeline(text):

if ent["entity_group"] in entities:

results.append({

"start": ent["start"],

"end": ent["end"],

"entity_type": ent["entity_group"],

"score": ent["score"]

})

return results

You can also filter detections by confidence score. For instance, ignore anything below 0.85 to reduce false positives. This way, your model focuses only on the most certain outputs, which is especially helpful in production settings where over-detection can lead to unnecessary redaction.

2. Add Custom Entities

Let’s say you work in insurance or healthcare and need to detect medical terms, policy IDs, or treatment names. These aren’t typically recognized out of the box, but you can train and deploy a model to Hugging Face and point Presidio to it, much like the previous example.

You can even blend rule-based detection with ML to lower false positives—for instance, using regex to find insurance numbers but double-checking with a model to avoid tagging random sequences.

Testing and Tuning Presidio on Your Data

Once it's set up, you'll want to run Presidio against your real or test dataset. Keep in mind that Presidio returns results as a list of detected entities, along with their corresponding confidence scores. You can visualize, log, or even compare the results with human annotations if you're working on improving a model.

Here’s how you might apply it to a data frame:

python

CopyEdit

import pandas as pd

df = pd.DataFrame({

"text": [

"John Doe lives at 123 Elm Street.",

"Email: [email protected]. Phone: 555-1234"

]

})

def detect_pii(text):

results = analyzer.analyze(text=text, language='en')

return [(res.entity_type, res.start, res.end) for res in results]

df["pii_detected"] = df["text"].apply(detect_pii)

If you need to anonymize the detected entities, use the PresidioAnonymizer:

python

CopyEdit

from presidio_anonymizer import AnonymizerEngine

anonymizer = AnonymizerEngine()

response = anonymizer.anonymize(text=text, analyzer_results=results)

print(response.text)

Now, your text can be safely stored or shared without exposing any personal information.

Conclusion

Presidio's integration with the Hugging Face Hub provides a flexible and extensible way to handle sensitive data in text. Whether you're working on compliance, data labeling, or just want to keep test data clean, automatic PII detection can save a lot of time and reduce risk.

You can keep things simple with built-in recognizers or expand the detection pipeline with transformer models hosted right on the Hub. It’s up to you how far you go, but with the foundation in place, experimenting becomes a lot more accessible—and more productive.

Advertisement

Recommended Updates

Technologies

Five AI Tools Turbocharging LinkedIn Growth: Unlocking Professional Success

Alison Perry / Jun 19, 2025

Know about 5 powerful AI tools that boost LinkedIn growth, enhance engagement, and help you build a strong professional presence

Technologies

One File, One Click: Simplifying LLM Execution with Llamafiles

Tessa Rodriguez / Jun 04, 2025

How llamafiles simplify LLM execution by offering a self-contained executable that eliminates setup hassles, supports local use, and works across platforms

Technologies

Quick and Easy Ways to List Files in a Directory with Python

Alison Perry / May 18, 2025

Learn how to list files in a directory using Python with clear and practical methods. Covering os, glob, and pathlib, this guide is all you need to get started

Technologies

Mastering the Kill Command for Process Control in Linux

Tessa Rodriguez / May 15, 2025

How to kill processes in Linux using the kill command. Understand signal types, usage examples, and safe process management techniques on Linux systems

Technologies

Is Apple Fueling the Bullish AI Investment Trend?

Tessa Rodriguez / Jun 05, 2025

Apple joins the bullish AI investment trend with bold moves in AI chips, on-device intelligence, and strategic innovation

Technologies

The Chatbot Guardrails Arena: A New Approach to AI Alignment

Alison Perry / May 26, 2025

How Introducing the Chatbot Guardrails Arena helps test and compare AI chatbot behavior across models with safety, tone, and policy checks in an open, community-driven environment

Technologies

Understanding File Handling in Python the Easy Way

Alison Perry / May 16, 2025

Learn everything about file handling in Python with this hands-on guide. Understand how to read and write files in Python through clear, practical methods anyone can follow

Technologies

Automation Anywhere Enhances RPA with Generative AI: What You Need to Know

Tessa Rodriguez / Jun 13, 2025

Automation Anywhere boosts RPA with generative AI, offering intelligent automation tools for smarter and faster workflows

Technologies

The Ultimate Guide to Real-Time AI ROI Tracking: Boost Your Business Efficiency

Tessa Rodriguez / Jun 19, 2025

Learn how to track real-time AI ROI, measure performance instantly, save costs, and make smarter business decisions every day

Technologies

The Role of Comments in Python: Different Types and Best Practices

Alison Perry / Jun 04, 2025

Learn the role of Python comments in writing readable and maintainable code. Understand their importance, types, and best practices in this clear and simplified guide

Technologies

How Is Microsoft Transforming Video Game Development with Its New World AI Model?

Tessa Rodriguez / Jun 05, 2025

Microsoft’s new AI model Muse revolutionizes video game creation by generating gameplay and visuals, empowering developers like never before

Technologies

Open Source Meets Science: Inside LeMaterial’s Approach to Materials Research

Tessa Rodriguez / May 13, 2025

LeMaterial is an open source platform designed to accelerate materials discovery through shared tools, data, and machine learning workflows. Discover how it’s reshaping collaboration in materials research