Advertisement
Working with data means being responsible for it—especially when that data contains sensitive information, such as names, phone numbers, or addresses. If you're handling any dataset that includes personal information, even unintentionally, there's a good chance you need a way to identify and manage that exposure. That's where tools like Presidio step in. Built by Microsoft, Presidio focuses on detecting and anonymizing personally identifiable information (PII) using both rule-based and machine-learning techniques. The best part? It's now on Hugging Face.
Let's take a walk through what it's like to experiment with automatic PII detection using Presidio on the Hugging Face Hub—from the basics to setting it up and running it on your data.
Presidio works as a modular open-source framework that can identify a wide range of PII entities in text. Whether it's names, credit card numbers, IP addresses, or medical terms, Presidio can detect them using a mix of pattern recognition, regular expressions, and language models.
With Presidio now integrated with the Hugging Face ecosystem, it becomes easier to connect with existing pipelines, transformers, and datasets. You no longer need to install heavy services or write all your detection logic from scratch. Instead, you can experiment right in a familiar environment, test the results, and scale when you’re ready.
Being on Hugging Face also means community support, model sharing, and reproducibility. You can fork someone else’s detection setup, tweak it, and test it against your data with just a few lines of code.
To get started, you'll need a working environment with Python and a Hugging Face token if you're working with private models or datasets.
Here’s a simplified setup:
bash
CopyEdit
pip install presidio-analyzer presidio-anonymizer
You’ll also want to load a model from Hugging Face. For example, let’s say you’re using a spaCy-based recognizer or a transformer model trained to detect custom PII types. Hugging Face provides pretrained models or allows you to bring your own.
python
CopyEdit
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
text = "My name is Rachel Green and my email is [email protected]."
results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PERSON"], language='en')
for result in results:
print(result)
This setup detects basic entities. If you need something more tailored, Presidio allows you to plug in your models or regular expressions.
Let’s break down how you can enhance Presidio by combining it with models hosted on the Hub.
If the default recognizers aren’t accurate enough, you can add a transformer-based one. Let’s say you trained a model to detect job titles, and it’s hosted at your-username/pii-job-titles. You can plug it into Presidio as a custom recognizer.
python
CopyEdit
from presidio_analyzer import RecognizerRegistry
from presidio_analyzer.nlp_engine import SpacyNlpEngine
from presidio_analyzer.predefined_recognizers import PatternRecognizer
from transformers import pipeline
custom_pipeline = pipeline("ner", model="your-username/pii-job-titles", aggregation_strategy="simple")
Using "aggregation_strategy" like "simple" or "first" allows the model to group word pieces into full entities. This matters when your data includes names or multi-word terms—like "Chief Technology Officer"—that might otherwise be split.
You'll then wrap this into a recognizer class. The model returns the start and end positions of detected spans, which Presidio uses to anonymize or extract the entity.
python
CopyEdit
class TransformerPIIRecognizer(PatternRecognizer):
def analyze(self, text, entities, nlp_artifacts=None):
results = []
for ent in custom_pipeline(text):
if ent["entity_group"] in entities:
results.append({
"start": ent["start"],
"end": ent["end"],
"entity_type": ent["entity_group"],
"score": ent["score"]
})
return results
You can also filter detections by confidence score. For instance, ignore anything below 0.85 to reduce false positives. This way, your model focuses only on the most certain outputs, which is especially helpful in production settings where over-detection can lead to unnecessary redaction.
Let’s say you work in insurance or healthcare and need to detect medical terms, policy IDs, or treatment names. These aren’t typically recognized out of the box, but you can train and deploy a model to Hugging Face and point Presidio to it, much like the previous example.
You can even blend rule-based detection with ML to lower false positives—for instance, using regex to find insurance numbers but double-checking with a model to avoid tagging random sequences.
Once it's set up, you'll want to run Presidio against your real or test dataset. Keep in mind that Presidio returns results as a list of detected entities, along with their corresponding confidence scores. You can visualize, log, or even compare the results with human annotations if you're working on improving a model.
Here’s how you might apply it to a data frame:
python
CopyEdit
import pandas as pd
df = pd.DataFrame({
"text": [
"John Doe lives at 123 Elm Street.",
"Email: [email protected]. Phone: 555-1234"
]
})
def detect_pii(text):
results = analyzer.analyze(text=text, language='en')
return [(res.entity_type, res.start, res.end) for res in results]
df["pii_detected"] = df["text"].apply(detect_pii)
If you need to anonymize the detected entities, use the PresidioAnonymizer:
python
CopyEdit
from presidio_anonymizer import AnonymizerEngine
anonymizer = AnonymizerEngine()
response = anonymizer.anonymize(text=text, analyzer_results=results)
print(response.text)
Now, your text can be safely stored or shared without exposing any personal information.
Presidio's integration with the Hugging Face Hub provides a flexible and extensible way to handle sensitive data in text. Whether you're working on compliance, data labeling, or just want to keep test data clean, automatic PII detection can save a lot of time and reduce risk.
You can keep things simple with built-in recognizers or expand the detection pipeline with transformer models hosted right on the Hub. It’s up to you how far you go, but with the foundation in place, experimenting becomes a lot more accessible—and more productive.
Advertisement
Know about 5 powerful AI tools that boost LinkedIn growth, enhance engagement, and help you build a strong professional presence
How llamafiles simplify LLM execution by offering a self-contained executable that eliminates setup hassles, supports local use, and works across platforms
Learn how to list files in a directory using Python with clear and practical methods. Covering os, glob, and pathlib, this guide is all you need to get started
How to kill processes in Linux using the kill command. Understand signal types, usage examples, and safe process management techniques on Linux systems
Apple joins the bullish AI investment trend with bold moves in AI chips, on-device intelligence, and strategic innovation
How Introducing the Chatbot Guardrails Arena helps test and compare AI chatbot behavior across models with safety, tone, and policy checks in an open, community-driven environment
Learn everything about file handling in Python with this hands-on guide. Understand how to read and write files in Python through clear, practical methods anyone can follow
Automation Anywhere boosts RPA with generative AI, offering intelligent automation tools for smarter and faster workflows
Learn how to track real-time AI ROI, measure performance instantly, save costs, and make smarter business decisions every day
Learn the role of Python comments in writing readable and maintainable code. Understand their importance, types, and best practices in this clear and simplified guide
Microsoft’s new AI model Muse revolutionizes video game creation by generating gameplay and visuals, empowering developers like never before
LeMaterial is an open source platform designed to accelerate materials discovery through shared tools, data, and machine learning workflows. Discover how it’s reshaping collaboration in materials research