🔐 Data Privacy & Redaction¶

The datason redaction engine provides comprehensive data privacy protection for sensitive information in ML workflows, including field-level redaction, pattern-based redaction, and audit trail logging for compliance requirements.

Overview¶

Data privacy is crucial when working with sensitive information in machine learning and data science workflows. The redaction engine helps you:

Protect PII: Automatically detect and redact personally identifiable information
Ensure Compliance: Meet GDPR, HIPAA, PCI-DSS, and other regulatory requirements
Audit Trails: Maintain complete logs of redaction operations for compliance
Domain-Specific: Pre-built configurations for financial, healthcare, and general use

Quick Start¶

import datason as ds

# Simple redaction
sensitive_data = {
    "customer_email": "john.doe@example.com",
    "credit_card": "4532-1234-5678-9012",
    "password": "secret123",
    "data": [1, 2, 3, 4, 5]
}

# Create redaction engine
engine = ds.create_minimal_redaction_engine()
redacted = engine.process_object(sensitive_data)

print(redacted)
# {
#     "customer_email": "<REDACTED>",
#     "credit_card": "<REDACTED>",
#     "password": "<REDACTED>",
#     "data": [1, 2, 3, 4, 5]
# }

Redaction Engines¶

Pre-built Engines¶

datason provides three pre-configured redaction engines for common use cases:

Minimal ProtectionFinancial DataHealthcare Data

# Basic privacy protection
engine = ds.create_minimal_redaction_engine()

Protects: - Passwords, secrets, keys, tokens - Email addresses

Use Cases: - Development environments - Basic privacy requirements - Lightweight applications

# Financial industry compliance
engine = ds.create_financial_redaction_engine()

Protects: - Credit card numbers - Social Security Numbers (SSN) - Tax IDs - Account numbers - Routing numbers - CVV codes - PINs

Features: - Audit trail enabled - Redaction summary - Large object detection (5MB threshold)

Use Cases: - Banking applications - Payment processing - Financial analytics

# Healthcare compliance (HIPAA)
engine = ds.create_healthcare_redaction_engine()

Protects: - Patient IDs - Medical record numbers - Personal information (names, addresses, phone) - Dates of birth - Diagnosis information

Features: - Full audit trail - Redaction summary - Large object protection

Use Cases: - Medical research - Healthcare analytics - Patient data processing

Custom Redaction Engine¶

For specific requirements, create a custom redaction engine:

from datason import RedactionEngine

# Create custom engine
engine = RedactionEngine(
    # Field patterns to redact
    redact_fields=[
        "*.password",           # Any field named 'password'
        "*.secret",             # Any field named 'secret'  
        "user.*.email",         # Email in user objects
        "*.ssn",                # Social Security Numbers
        "config.api_key",       # Specific field path
    ],

    # Regex patterns for content redaction
    redact_patterns=[
        r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",  # Credit cards
        r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",  # Emails
        r"\b\d{3}-\d{2}-\d{4}\b",  # US SSN format
    ],

    # Large object redaction
    redact_large_objects=True,
    large_object_threshold=1024 * 1024,  # 1MB

    # Customization
    redaction_replacement="[CONFIDENTIAL]",

    # Compliance features
    include_redaction_summary=True,
    audit_trail=True,
)

Field Pattern Matching¶

Field patterns support wildcards for flexible matching:

patterns = [
    "password",           # Exact match: field named 'password'
    "*.password",         # Any field ending with 'password'
    "user.*.email",       # Email field in any user object
    "config.api.*",       # Any API-related config field
    "*.secret*",          # Any field containing 'secret'
]

Pattern-based Redaction¶

Automatically detect sensitive content using regex patterns:

# Common sensitive patterns
patterns = [
    # Credit card numbers (various formats)
    r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",

    # US Social Security Numbers
    r"\b\d{3}-\d{2}-\d{4}\b",

    # Email addresses
    r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",

    # Phone numbers (US format)
    r"\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b",

    # IPv4 addresses
    r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",
]

engine = RedactionEngine(redact_patterns=patterns)

text = "Contact John at john.doe@company.com or call 555-123-4567"
redacted_text, was_redacted = engine.redact_text(text)
print(redacted_text)
# "Contact John at <REDACTED> or call <REDACTED>"

Large Object Protection¶

Protect against accidentally serializing large data objects:

import numpy as np

engine = RedactionEngine(
    redact_large_objects=True,
    large_object_threshold=1024 * 1024,  # 1MB threshold
)

data = {
    "large_array": np.random.random((1000, 1000)),  # ~8MB array
    "small_data": [1, 2, 3, 4, 5],
}

redacted = engine.process_object(data)
print(redacted["large_array"])
# "<LARGE_OBJECT_REDACTED: ndarray, ~8,000,000 bytes>"

Audit Trail & Compliance¶

Enable comprehensive logging for compliance requirements:

engine = RedactionEngine(
    redact_fields=["*.password", "*.ssn"],
    audit_trail=True,
    include_redaction_summary=True,
)

data = {
    "user": {
        "name": "John Doe",
        "password": "secret123",
        "ssn": "123-45-6789"
    }
}

redacted = engine.process_object(data)

# Get redaction summary
summary = engine.get_redaction_summary()
print(summary)
# {
#     "redaction_summary": {
#         "fields_redacted": ["user.password", "user.ssn"],
#         "patterns_matched": [],
#         "large_objects_redacted": [],
#         "total_redactions": 2,
#         "redaction_timestamp": "2024-01-15T10:30:00.000000+00:00"
#     }
# }

# Get audit trail
audit_trail = engine.get_audit_trail()
for entry in audit_trail:
    print(f"{entry['timestamp']}: Redacted {entry['target']} ({entry['redaction_type']})")

Integration with Serialization¶

Combine redaction with datason serialization:

import datason as ds
import pandas as pd

# Sensitive data
sensitive_data = {
    "customers": pd.DataFrame({
        "id": [1, 2, 3],
        "name": ["John Doe", "Jane Smith", "Bob Johnson"],
        "email": ["john@example.com", "jane@example.com", "bob@example.com"],
        "ssn": ["123-45-6789", "987-65-4321", "555-12-3456"]
    }),
    "api_config": {
        "api_key": "secret-key-12345",
        "endpoint": "https://api.example.com"
    }
}

# Create redaction engine
engine = ds.create_financial_redaction_engine()

# Redact sensitive information
redacted_data = engine.process_object(sensitive_data)

# Serialize the redacted data
config = ds.get_api_config()
json_safe = ds.serialize(redacted_data, config=config)

Advanced Examples¶

Custom Domain Patterns¶

# Redaction for specific domains
class CompanyRedactionEngine(RedactionEngine):
    def __init__(self):
        super().__init__(
            redact_fields=[
                "*.employee_id",
                "*.salary",
                "*.performance_rating",
                "hr.*.personal_info",
                "finance.*.budget",
            ],
            redact_patterns=[
                r"\b[Ee]mp-\d{6}\b",  # Employee IDs
                r"\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?\b",  # Currency amounts
            ],
            audit_trail=True,
        )

engine = CompanyRedactionEngine()

Conditional Redaction¶

class ConditionalRedactionEngine(RedactionEngine):
    def __init__(self, environment="production"):
        # Only redact in production
        if environment == "production":
            super().__init__(
                redact_fields=["*.password", "*.api_key"],
                audit_trail=True,
            )
        else:
            # Development - minimal redaction
            super().__init__(
                redact_fields=["*.password"],
                audit_trail=False,
            )

# Usage
prod_engine = ConditionalRedactionEngine("production")
dev_engine = ConditionalRedactionEngine("development")

Best Practices¶

1. Start with Pre-built Engines¶

Use domain-specific engines as starting points:

# Good: Start with proven configuration
engine = ds.create_financial_redaction_engine()

# Then customize if needed
engine.redact_fields.extend(["*.internal_id", "*.customer_segment"])

2. Test Redaction Patterns¶

Always test your patterns with sample data:

# Test patterns before deployment
test_data = {
    "test_email": "test@example.com",
    "test_ssn": "123-45-6789",
    "test_credit": "4532-1234-5678-9012"
}

redacted = engine.process_object(test_data)
print("Redaction test:", redacted)

3. Monitor Redaction Performance¶

Large objects can impact performance:

import time

start_time = time.time()
redacted = engine.process_object(large_data)
redaction_time = time.time() - start_time

summary = engine.get_redaction_summary()
print(f"Redacted {summary['redaction_summary']['total_redactions']} items in {redaction_time:.2f}s")

4. Compliance Documentation¶

Maintain documentation for compliance:

# Document your redaction strategy
redaction_policy = {
    "description": "Customer data redaction for ML training",
    "fields_protected": ["email", "ssn", "credit_card"],
    "compliance_standards": ["GDPR", "CCPA"],
    "retention_policy": "Audit logs kept for 7 years",
    "engine_config": "create_financial_redaction_engine()"
}

Compliance Standards¶

The redaction engine helps meet various compliance requirements:

Standard	Focus Area	Supported Features
GDPR	Personal data protection	Field redaction, audit trails, data minimization
HIPAA	Healthcare data	PHI detection, audit logging, access controls
PCI-DSS	Payment card data	Credit card detection, secure handling
CCPA	California privacy	Personal information redaction
SOX	Financial reporting	Data integrity, audit trails

Performance Considerations¶

Pattern Complexity: Complex regex patterns can impact performance
Large Objects: Enable large object detection for memory protection
Audit Trails: Add minimal overhead but provide compliance value
Field Patterns: Wildcard patterns are efficient for nested structures

Error Handling¶

try:
    redacted = engine.process_object(data)
except Exception as e:
    print(f"Redaction error: {e}")
    # Fallback: redact everything or fail safe
    redacted = {"error": "Data redacted due to processing error"}

API Reference¶

See the API Overview for complete function documentation and parameters.