Skip to content

Deserialization & Type Support

Complete guide to datason's deserialization capabilities, supported types, and type preservation strategies.

๐ŸŽฏ Overview

Datason provides sophisticated deserialization with automatic type detection and optional type metadata for perfect round-trip fidelity. This page documents all supported types and their behavior with and without type hints.

๐Ÿ“Š Current Round-Trip Audit Status (v0.7.0)

Overall Success Rate: 75.0% (51/68 tests passed) โฌ†๏ธ Recently improved from 67.6%

Success Rate by Category:

  • Basic Types: 100% (20/20) โœ… Perfect! (improved from 95.0%)
  • Complex Types: 100% (15/15) โœ… Perfect! (improved from 93.3%)
  • NumPy Types: 71.4% (10/14) โš ๏ธ Needs work
  • Pandas Types: 30.8% (4/13) โŒ Critical gaps
  • ML Types: 33.3% (2/6) โš ๏ธ Improving (was 0.0%)

Recent Fixes:

  • โœ… UUID Cache Issue: Fixed cache pollution causing UUID detection to fail
  • โœ… Auto-Detection: UUID and datetime auto-detection now working reliably
  • โœ… PyTorch Tensors: Fixed tensor comparison logic in audit script (+3.0%)
  • โœ… Set/Tuple Verification: Fixed audit logic to handle expected type conversions (+2.9%)
  • โœ… Test Infrastructure: Added comprehensive enhanced type support tests

Enhancement Roadmap:

  • v0.7.5 Target: Fix critical gaps โ†’ 85%+ success rate (need +10% more)
  • v0.8.0 Target: Complete round-trip support โ†’ 95%+ success rate
  • v0.8.5 Target: Perfect ML integration โ†’ 99%+ success rate

๐Ÿš€ Performance Features (v0.6.0+)

Ultra-Fast Deserializer

from datason.deserializers import deserialize_fast

# 3.73x average performance improvement
result = deserialize_fast(data)

# With configuration for type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)

Performance Benchmarks

Data Type Improvement Use Case
Basic Types 0.84x (18% faster) Ultra-fast path
DateTime/UUID Heavy 3.49x Log processing
Large Nested 16.86x Complex data structures
Average Overall 3.73x All workloads

๐Ÿ“‹ Type Support Matrix

Types That Work WITHOUT Type Hints

These types serialize and deserialize perfectly without include_type_hints=True:

Basic JSON Types (Perfect Round-Trips)

# Perfect preservation without any configuration
none_value = None
string_value = "hello world"
unicode_string = "hello ไธ–็•Œ ๐ŸŒ"
integer_value = 42
float_value = 3.14159
boolean_value = True
list_value = [1, 2, 3]
dict_value = {"key": "value"}

# All preserve exact types and values
result = deserialize_fast(serialized_data)
Type Example Notes
None None Direct pass-through
str "hello" Unicode support
int 42 All integer sizes
float 3.14 Full precision
bool True No conversion
list [1, 2, 3] Recursive support
dict {"a": 1} String keys preserved

Auto-Detectable Types (Smart Recognition)

# Automatically detected without configuration
from datetime import datetime
import uuid

dt = datetime(2023, 1, 1, 12, 0, 0)
uid = uuid.uuid4()

# Smart pattern recognition - no type hints needed
result = deserialize_fast('2023-01-01T12:00:00')  # โ†’ datetime
result = deserialize_fast('12345678-1234-5678-9012-123456789abc')  # โ†’ UUID
Type Pattern Auto-Detection
datetime ISO format strings โœ… Full support
UUID Standard UUID format โœ… Full support
Path Absolute paths โœ… Limited support

Legacy Types (Always Get Metadata)

from decimal import Decimal

# These ALWAYS get type metadata regardless of settings
complex_num = complex(1, 2)
decimal_num = Decimal("123.45")

# Round-trip perfectly without explicit type hints
serialized = serialize(complex_num)  # Always includes metadata
result = deserialize_fast(serialized)  # โ†’ complex(1, 2)
Type Behavior Reason
complex Always preserved Legacy compatibility
Decimal Always preserved Precision requirements

Types That NEED Type Hints

These types require include_type_hints=True for perfect round-trip preservation:

Container Types (Lose Type Information)

config = SerializationConfig(include_type_hints=True)

# Without type hints: tuple โ†’ list, set โ†’ list
original_tuple = (1, 2, 3)
original_set = {1, 2, 3}

# With type hints: perfect preservation
serialized = serialize(original_tuple, config=config)
result = deserialize_fast(serialized, config=config)
assert type(result) is tuple  # โœ… Preserved
Type Without Hints With Hints Notes
tuple โ†’ list โœ… tuple Order preserved
set โ†’ list โœ… set Order may change
frozenset โ†’ list โœ… frozenset Immutability preserved

NumPy Types (Become Python Primitives)

import numpy as np

config = SerializationConfig(include_type_hints=True)

# NumPy scalars
np_int = np.int32(42)
np_float = np.float64(3.14)
np_bool = np.bool_(True)

# Arrays
np_array = np.array([[1, 2], [3, 4]])
np_zeros = np.zeros((3, 3))
NumPy Type Without Hints With Hints Notes
np.int* โ†’ int โœ… Exact type All integer types
np.float* โ†’ float โœ… Exact type Precision preserved
np.bool_ โ†’ bool โœ… np.bool_ Type distinction
np.array โ†’ list โœ… np.array Shape + dtype preserved
np.complex* โ†’ str/complex โœ… Exact type Complex number handling

Pandas Types (Become Lists/Dicts)

import pandas as pd

config = SerializationConfig(include_type_hints=True)

# DataFrame
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

# Series  
series = pd.Series([1, 2, 3], name="my_series")

# Categorical data
categorical = pd.Series(pd.Categorical(['a', 'b', 'a', 'c']))
Pandas Type Without Hints With Hints Notes
DataFrame โ†’ list (records) โœ… DataFrame Index + columns preserved
Series โ†’ dict โœ… Series Index + name preserved
Categorical โ†’ dict (as object) โœ… Categorical Categories + ordered preserved

Machine Learning Types (Become Dicts)

import torch
from sklearn.linear_model import LogisticRegression

config = SerializationConfig(include_type_hints=True)

# PyTorch tensors
tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]])

# Sklearn models  
model = LogisticRegression(random_state=42)
ML Type Without Hints With Hints Notes
torch.Tensor โ†’ list โœ… torch.Tensor Device + dtype preserved
sklearn.* โ†’ dict โœ… Model class Parameters preserved*

*Note: Sklearn fitted state (weights) cannot be preserved without training data.

๐Ÿ”ง Configuration Options

Basic Deserialization

from datason.deserializers import deserialize_fast

# Default behavior - auto-detection only
result = deserialize_fast(data)

Type Preservation

from datason.config import SerializationConfig

# Full type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)

Security Limits

# Custom security limits
config = SerializationConfig(
    max_depth=50,        # Maximum nesting depth
    max_size=100_000,    # Maximum container size
    include_type_hints=True
)
result = deserialize_fast(data, config=config)

DataFrame Orientations

from datason.config import DataFrameOrient

# Different DataFrame serialization formats
config = SerializationConfig(
    dataframe_orient=DataFrameOrient.RECORDS,  # Default
    include_type_hints=True
)

# Available orientations:
# - RECORDS: [{"a": 1, "b": 4}, {"a": 2, "b": 5}]
# - SPLIT: {"index": [...], "columns": [...], "data": [...]}
# - INDEX: {0: {"a": 1, "b": 4}, 1: {"a": 2, "b": 5}}
# - VALUES: [[1, 4], [2, 5]]  # Loses column names

๐ŸŽจ Advanced Features

Automatic String Type Detection

# Smart pattern recognition without type hints
datetime_str = "2023-01-01T12:00:00"
uuid_str = "12345678-1234-5678-9012-123456789abc"
path_str = "/tmp/test/file.txt"

# Automatically converted to appropriate types
dt = deserialize_fast(datetime_str)    # โ†’ datetime object
uid = deserialize_fast(uuid_str)       # โ†’ UUID object
path = deserialize_fast(path_str)      # โ†’ Path object (if absolute)

Type Metadata Format

# New format (v0.6.0+)
{
    "__datason_type__": "numpy.int32",
    "__datason_value__": 42
}

# Legacy format (still supported)
{
    "_type": "numpy.int32",
    "value": 42
}

Complex Type Reconstruction

# Complex numbers - automatic detection
complex_dict = {"real": 1.0, "imag": 2.0}
result = deserialize_fast(complex_dict)  # โ†’ complex(1, 2)

# Decimal from string pattern
decimal_dict = {"value": "123.456"}
result = deserialize_fast(decimal_dict)  # โ†’ Decimal("123.456")

๐Ÿ›ก๏ธ Security Features

Depth Protection

# Protects against deeply nested attacks
config = SerializationConfig(max_depth=5)

deep_data = {"a": {"b": {"c": {"d": {"e": {"f": "too deep"}}}}}}
# Raises DeserializationSecurityError when depth > 5

Size Limits

# Protects against memory exhaustion
config = SerializationConfig(max_size=1000)

huge_dict = {f"key_{i}": i for i in range(10000)}
# Raises DeserializationSecurityError when size > 1000

Circular Reference Detection

# Safe handling of circular references
circular_data = {"self": None}
circular_data["self"] = circular_data

result = deserialize_fast(circular_data)
# Returns: {"self": {}}  # Circular reference safely broken

๐Ÿ“Š Performance Optimization

Ultra-Fast Basic Types

# Zero-overhead processing for common cases
result = deserialize_fast(42)         # Immediate return
result = deserialize_fast("hello")    # < 8 chars = immediate return
result = deserialize_fast(True)       # Immediate return
result = deserialize_fast(None)       # Immediate return

Caching Systems

New in v0.7.0: Configurable caching system with multiple scopes for different workflows.

import datason
from datason import CacheScope

# Choose cache scope based on your needs
datason.set_cache_scope(CacheScope.REQUEST)   # For web APIs
datason.set_cache_scope(CacheScope.PROCESS)   # For ML training

# Pattern caching for repeated strings (automatic)
datetime_str = "2023-01-01T12:00:00"
result1 = datason.deserialize_fast(datetime_str)  # Parse and cache
result2 = datason.deserialize_fast(datetime_str)  # Cache hit! (faster)

# Object caching for parsed results (automatic)
uuid_str = "12345678-1234-5678-9012-123456789abc"
result1 = datason.deserialize_fast(uuid_str)   # Parse and cache UUID object
result2 = datason.deserialize_fast(uuid_str)   # Return cached UUID object

See the Caching Documentation for detailed configuration and performance tuning.

Memory Pooling

# Container pooling for reduced allocations
# Automatically managed - no user configuration needed
large_nested = {"data": [{"item": i} for i in range(1000)]}
result = deserialize_fast(large_nested)  # Uses pooled containers internally

๐Ÿงช Testing & Validation

Round-Trip Testing

def test_round_trip(original_data):
    """Test perfect round-trip preservation"""
    config = SerializationConfig(include_type_hints=True)

    # Serialize
    serialized = serialize(original_data, config=config)

    # Ensure JSON compatibility
    json_str = json.dumps(serialized)
    parsed = json.loads(json_str)

    # Deserialize
    deserialized = deserialize_fast(parsed, config=config)

    # Validate
    assert type(deserialized) is type(original_data)
    assert deserialized == original_data

Type Validation

# Comprehensive type support validation
test_cases = [
    (42, int),                           # Basic int
    (np.int32(42), np.int32),           # NumPy int32
    ((1, 2, 3), tuple),                 # Tuple preservation
    (pd.Series([1, 2, 3]), pd.Series),  # Pandas Series
    (torch.tensor([1, 2]), torch.Tensor), # PyTorch tensor
]

config = SerializationConfig(include_type_hints=True)
for original, expected_type in test_cases:
    serialized = serialize(original, config=config)
    result = deserialize_fast(serialized, config=config)
    assert type(result) is expected_type

๐Ÿ”„ Migration Guide

From v0.5.x to v0.6.0

# Old approach
from datason.deserializers import deserialize
result = deserialize(data)

# New optimized approach (3.73x faster)
from datason.deserializers import deserialize_fast
result = deserialize_fast(data)  # Drop-in replacement

# Enhanced type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)

Performance Migration

# Replace slow patterns
for item in large_dataset:
    result = deserialize(item)  # Slow

# With optimized batch processing
results = [deserialize_fast(item) for item in large_dataset]  # Fast

๐Ÿ“ˆ Best Practices

Type Preservation Strategy

# For maximum compatibility: include type hints
config = SerializationConfig(include_type_hints=True)

# For maximum performance: rely on auto-detection
# (only for basic types + datetime/UUID)
result = deserialize_fast(data)  # No config needed

Error Handling

from datason.deserializers import DeserializationSecurityError

try:
    result = deserialize_fast(untrusted_data, config=config)
except DeserializationSecurityError as e:
    logger.warning(f"Security violation in deserialization: {e}")
    # Handle security issue appropriately
except Exception as e:
    logger.error(f"Deserialization failed: {e}")
    # Handle other errors

Production Monitoring

import warnings

# Monitor security warnings in production
with warnings.catch_warnings(record=True) as w:
    result = deserialize_fast(data)
    for warning in w:
        if "circular reference" in str(warning.message).lower():
            logger.info(f"Circular reference detected and handled: {warning.message}")

This comprehensive deserialization system provides both high performance and complete type safety, making it suitable for everything from high-throughput data processing to complex scientific computing workflows.