datason Features Overview¶

datason provides intelligent serialization through a layered architecture of features, from core JSON compatibility to advanced ML/AI object handling and configurable behavior.

🎯 Feature Categories¶

Core Serialization ¶

The foundation layer providing basic JSON compatibility and safety features.

Basic Types: str, int, float, bool, None, list, dict
Security: Circular reference detection, depth limits, size limits
Performance: Optimization for already-serialized data
Error Handling: Graceful fallbacks for unsupported types

Deserialization & Type Support 🆕 v0.6.0¶

Ultra-fast deserialization with comprehensive type support and intelligent auto-detection.

Performance: 3.73x average improvement, 16.86x on large nested data
Type Matrix: Complete documentation of 133+ supported types
Auto-Detection: Smart recognition of datetime, UUID, and numeric patterns
Type Preservation: Optional metadata for perfect round-trip fidelity
Security: Depth/size limits with zero performance impact

Advanced Types ¶

Extended support for Python's rich type system and specialized objects.

Built-in Types: complex, decimal.Decimal, uuid.UUID, pathlib.Path
Collections: set, frozenset, namedtuple, range, bytes
Enums: Support for enum.Enum and custom enumeration classes
Type Coercion: Configurable strategies from strict to aggressive

Date/Time Handling ¶

Comprehensive support for temporal data with timezone awareness.

Formats: ISO, Unix timestamp, Unix milliseconds, custom patterns
Types: datetime, date, time, timedelta
Pandas Integration: pd.Timestamp, pd.NaT, pd.DatetimeIndex
Timezone Support: Aware and naive datetime handling

File Operations 🆕 v0.11.0¶

Complete JSON/JSONL file I/O integrated as first-class citizens in the modern API.

Dual Format Support: Both JSON (.json) and JSONL (.jsonl) with auto-detection
Progressive API: Basic → Smart → Perfect loading complexity
Full Integration: All datason features work with files (ML, security, streaming, compression)
Domain-Specific: Specialized save_ml(), save_secure(), save_api() functions
Auto-Compression: Automatic .gz compression detection and handling

Chunked Processing & Streaming 🆕 v0.4.0¶

Memory-efficient handling of large datasets that exceed available RAM.

Chunked Serialization: Break large objects into manageable pieces
Streaming: Continuous data writing without memory accumulation
Memory Estimation: Automatic optimization recommendations
File Formats: JSONL and JSON array support for chunked data

Template-Based Deserialization 🆕 v0.4.5¶

Type-guided reconstruction with ML-optimized round-trip fidelity.

Template Guidance: Use reference objects to ensure consistent types
Auto-Inference: Generate templates from sample data
ML Templates: Specialized templates for machine learning workflows
Type Validation: Consistent data structure validation

Data Utilities with Security Patterns 🆕 v0.5.5¶

Comprehensive data analysis and transformation tools with consistent security protection.

Deep Comparison: Advanced object comparison with tolerance and security limits
Anomaly Detection: Identify large strings, collections, and suspicious patterns
Type Enhancement: Smart type inference and conversion with safety checks
Structure Normalization: Flatten or transform data structures securely
Datetime Processing: Standardize formats and extract temporal features
Pandas/NumPy Integration: Enhanced DataFrame and array processing with limits
Configurable Security: Environment-specific configurations for different trust levels

Data Integrity & Verification 🆕 v0.10.0¶

Cryptographic hashing and signature utilities for tamper detection.

Canonicalization: Deterministic JSON for stable hashing
Hash & Verify: Object and JSON hashing with algorithm validation
Digital Signatures: Ed25519 sign/verify support
Redaction Integration: Optional PII removal before hashing

ML/AI Integration ¶

Native support for machine learning and scientific computing objects.

PyTorch: Tensors, models, parameters
TensorFlow: Tensors, variables, SavedModel metadata
Scikit-learn: Fitted models, pipelines, transformers
NumPy: Arrays, scalars, dtypes
JAX: Arrays and computation graphs
PIL/Pillow: Images with format preservation

Model Serving Integration ¶

Guides for BentoML, Ray Serve, Streamlit, Gradio, MLflow, and Seldon/KServe.

Pandas Integration ¶

Deep integration with the pandas ecosystem for data science workflows.

DataFrames: Configurable orientation (records, split, index, columns, values, table)
Series: Index preservation and metadata handling
Index Types: RangeIndex, DatetimeIndex, MultiIndex
Categorical: Category metadata and ordering
NaN Handling: Configurable strategies for missing data

Configuration System ¶

Fine-grained control over serialization behavior with preset configurations.

Presets: ML, API, Strict, Performance optimized configurations
Domain-Specific: Financial, Time Series, Inference, Research, Logging presets
Date Formats: 5 different datetime serialization formats
NaN Handling: 4 strategies for missing/null values
Type Coercion: 3 levels from strict type preservation to aggressive conversion
Custom Serializers: Register handlers for custom types

Pickle Bridge ¶

Secure migration of legacy ML pickle files to portable JSON format.

Security-First: Class whitelisting prevents arbitrary code execution
Zero Dependencies: Uses only Python standard library
ML Coverage: 54 safe classes covering 95%+ of common pickle files
Bulk Processing: Directory-level conversion with statistics tracking
Production Ready: File size limits, error handling, monitoring

Configurable Caching System 🆕 v0.7.0¶

Intelligent caching that adapts to different workflow requirements with multiple cache scopes.

Multiple Scopes: Operation, Request, Process, and Disabled caching modes
Performance: 50-200% speed improvements for repeated operations
ML-Optimized: Perfect for training loops and data analytics
Context Managers: Easy scope management and isolation
Metrics & Monitoring: Built-in cache performance tracking
Object Pooling: Memory-efficient object reuse with automatic cleanup

Performance Features ¶

Optimizations for speed and memory efficiency in production environments.

Early Detection: Skip processing for JSON-compatible data
Memory Streaming: Handle large datasets without full memory loading
Configurable Limits: Prevent resource exhaustion attacks
Benchmarking: Built-in performance measurement tools
Intelligent Caching: Context-aware caching for maximum performance

🚀 Quick Feature Matrix¶

Feature Category	Basic	Advanced	Enterprise
Core Types	✅ JSON types	✅ + Python types	✅ + Custom types
Large Data	❌	✅ Chunked/Streaming	✅ + Memory optimization
Type Safety	❌	✅ Template validation	✅ + ML round-trip
ML/AI Objects	❌	✅ Common libraries	✅ + Custom models
Configuration	❌	✅ Presets	✅ + Full control
Pickle Bridge	❌	✅ Safe conversion	✅ + Bulk migration
Performance	✅ Basic	✅ Optimized	✅ + Monitoring
Caching	❌	✅ Operation scope	✅ + All scopes + Metrics
Data Science	❌	✅ Pandas/NumPy	✅ + Advanced
Data Utilities	❌	✅ Basic tools	✅ + Security patterns

📖 Usage Patterns¶

Simple Usage (Core Features)¶

import datason

# Works out of the box
data = {"users": [1, 2, 3], "timestamp": datetime.now()}
result = datason.serialize(data)

Data Analysis & Transformation (v0.5.5)¶

import datason

# Compare complex data structures with tolerance
obj1 = {"users": [{"score": 85.5}], "metadata": {"version": "1.0"}}
obj2 = {"users": [{"score": 85.6}], "metadata": {"version": "1.0"}}
comparison = datason.deep_compare(obj1, obj2, tolerance=1e-1)

# Detect anomalies and security issues
messy_data = {"large_text": "x" * 50000, "items": list(range(5000))}
anomalies = datason.find_data_anomalies(messy_data)

# Smart type enhancement with security
raw_data = {"id": "123", "score": "85.5", "active": "true"}
enhanced, report = datason.enhance_data_types(raw_data)
# enhanced["id"] is now int(123), not string

# Configurable security for different environments
from datason.utils import UtilityConfig
api_config = UtilityConfig(max_depth=10, max_object_size=10_000)
result = datason.find_data_anomalies(untrusted_data, config=api_config)

Large Data Processing (v0.4.0)¶

import datason

# Memory-efficient processing of large datasets
large_data = create_huge_dataset()  # Multi-GB dataset

# Process in chunks without memory overflow
result = datason.serialize_chunked(large_data, chunk_size=10000)

# Stream data to file
with datason.stream_serialize("output.jsonl") as stream:
    for item in continuous_data_source():
        stream.write(item)

Type-Safe Deserialization (v0.4.5)¶

import datason
from datason.deserializers import deserialize_with_template

# Ensure consistent types with templates
template = {"user_id": 0, "created": datetime.now(), "active": True}
data = {"user_id": "123", "created": "2023-01-01T10:00:00", "active": "true"}

# Template ensures proper type conversion
result = deserialize_with_template(data, template)
# result["user_id"] is int(123), not string

Configured Usage (Advanced Features)¶

import datason
from datason.config import get_ml_config

# Optimized for ML workflows
config = get_ml_config()
result = datason.serialize(ml_data, config=config)

High-Performance Caching (v0.7.0)¶

import datason
from datason import CacheScope

# ML training with maximum performance
datason.set_cache_scope(CacheScope.PROCESS)
for epoch in range(num_epochs):
    for batch in training_data:
        # Repeated datetime/UUID patterns cached automatically
        parsed_batch = datason.deserialize_fast(batch)  # 150-200% faster!
        train_step(parsed_batch)

# Web API with request-scoped caching
def api_handler(request_data):
    with datason.request_scope():
        # Cache shared within request, cleared between requests
        return process_api_request(request_data)

# Monitor cache performance
metrics = datason.get_cache_metrics()
print(f"Cache hit rate: {metrics[CacheScope.PROCESS].hit_rate:.1%}")

Custom Usage (Enterprise Features)¶

import datason
from datason.config import SerializationConfig, DateFormat, TypeCoercion

# Full control over behavior
config = SerializationConfig(
    date_format=DateFormat.UNIX_MS,
    type_coercion=TypeCoercion.AGGRESSIVE,
    preserve_decimals=True,
    custom_serializers={MyClass: my_serializer}
)
result = datason.serialize(data, config=config)

Pickle Bridge Usage (ML Migration)¶

import datason

# Convert legacy pickle files safely
result = datason.from_pickle("legacy_model.pkl")

# Bulk migration with security controls
stats = datason.convert_pickle_directory(
    source_dir="old_models/",
    target_dir="json_models/",
    safe_classes=datason.get_ml_safe_classes()
)

🛣️ Feature Roadmap¶

✅ Available Now¶

Core serialization with safety features
Advanced Python type support
ML/AI object integration
Configuration system with presets
Pandas deep integration
Performance optimizations
Pickle Bridge for legacy ML migration
🆕 Chunked processing & streaming (v0.4.0)
🆕 Template-based deserialization (v0.4.5)
🆕 Data utilities with security patterns (v0.5.5)

🔄 In Development¶

Schema validation
Compression support
Plugin architecture
Type hints integration

🔮 Planned¶

GraphQL integration
Protocol Buffers support
Arrow format compatibility
Cloud storage adapters
Real-time synchronization

📚 Learn More¶

Each feature category has detailed documentation with examples, best practices, and performance considerations:

Core Serialization → - Start here for basic usage
Chunked Processing → - 🆕 Handle large datasets efficiently
Template Deserialization → - 🆕 Type-safe reconstruction
Data Utilities → - 🆕 Analysis & transformation with security
Configuration System → - Control serialization behavior
ML/AI Integration → - Work with ML frameworks
Performance Guide → - Optimize for production
Migration Guide → - Upgrade from other serializers