Deserialization & Type Support¶
Complete guide to datason's deserialization capabilities, supported types, and type preservation strategies.
๐ฏ Overview¶
Datason provides sophisticated deserialization with automatic type detection and optional type metadata for perfect round-trip fidelity. This page documents all supported types and their behavior with and without type hints.
๐ Current Round-Trip Audit Status (v0.7.0)¶
Overall Success Rate: 75.0% (51/68 tests passed) โฌ๏ธ Recently improved from 67.6%
Success Rate by Category:¶
- Basic Types: 100% (20/20) โ Perfect! (improved from 95.0%)
- Complex Types: 100% (15/15) โ Perfect! (improved from 93.3%)
- NumPy Types: 71.4% (10/14) โ ๏ธ Needs work
- Pandas Types: 30.8% (4/13) โ Critical gaps
- ML Types: 33.3% (2/6) โ ๏ธ Improving (was 0.0%)
Recent Fixes:¶
- โ UUID Cache Issue: Fixed cache pollution causing UUID detection to fail
- โ Auto-Detection: UUID and datetime auto-detection now working reliably
- โ PyTorch Tensors: Fixed tensor comparison logic in audit script (+3.0%)
- โ Set/Tuple Verification: Fixed audit logic to handle expected type conversions (+2.9%)
- โ Test Infrastructure: Added comprehensive enhanced type support tests
Enhancement Roadmap:¶
- v0.7.5 Target: Fix critical gaps โ 85%+ success rate (need +10% more)
- v0.8.0 Target: Complete round-trip support โ 95%+ success rate
- v0.8.5 Target: Perfect ML integration โ 99%+ success rate
๐ Performance Features (v0.6.0+)¶
Ultra-Fast Deserializer¶
from datason.deserializers import deserialize_fast
# 3.73x average performance improvement
result = deserialize_fast(data)
# With configuration for type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)
Performance Benchmarks¶
Data Type | Improvement | Use Case |
---|---|---|
Basic Types | 0.84x (18% faster) | Ultra-fast path |
DateTime/UUID Heavy | 3.49x | Log processing |
Large Nested | 16.86x | Complex data structures |
Average Overall | 3.73x | All workloads |
๐ Type Support Matrix¶
Types That Work WITHOUT Type Hints¶
These types serialize and deserialize perfectly without include_type_hints=True
:
Basic JSON Types (Perfect Round-Trips)¶
# Perfect preservation without any configuration
none_value = None
string_value = "hello world"
unicode_string = "hello ไธ็ ๐"
integer_value = 42
float_value = 3.14159
boolean_value = True
list_value = [1, 2, 3]
dict_value = {"key": "value"}
# All preserve exact types and values
result = deserialize_fast(serialized_data)
Type | Example | Notes |
---|---|---|
None |
None |
Direct pass-through |
str |
"hello" |
Unicode support |
int |
42 |
All integer sizes |
float |
3.14 |
Full precision |
bool |
True |
No conversion |
list |
[1, 2, 3] |
Recursive support |
dict |
{"a": 1} |
String keys preserved |
Auto-Detectable Types (Smart Recognition)¶
# Automatically detected without configuration
from datetime import datetime
import uuid
dt = datetime(2023, 1, 1, 12, 0, 0)
uid = uuid.uuid4()
# Smart pattern recognition - no type hints needed
result = deserialize_fast('2023-01-01T12:00:00') # โ datetime
result = deserialize_fast('12345678-1234-5678-9012-123456789abc') # โ UUID
Type | Pattern | Auto-Detection |
---|---|---|
datetime |
ISO format strings | โ Full support |
UUID |
Standard UUID format | โ Full support |
Path |
Absolute paths | โ Limited support |
Legacy Types (Always Get Metadata)¶
from decimal import Decimal
# These ALWAYS get type metadata regardless of settings
complex_num = complex(1, 2)
decimal_num = Decimal("123.45")
# Round-trip perfectly without explicit type hints
serialized = serialize(complex_num) # Always includes metadata
result = deserialize_fast(serialized) # โ complex(1, 2)
Type | Behavior | Reason |
---|---|---|
complex |
Always preserved | Legacy compatibility |
Decimal |
Always preserved | Precision requirements |
Types That NEED Type Hints¶
These types require include_type_hints=True
for perfect round-trip preservation:
Container Types (Lose Type Information)¶
config = SerializationConfig(include_type_hints=True)
# Without type hints: tuple โ list, set โ list
original_tuple = (1, 2, 3)
original_set = {1, 2, 3}
# With type hints: perfect preservation
serialized = serialize(original_tuple, config=config)
result = deserialize_fast(serialized, config=config)
assert type(result) is tuple # โ
Preserved
Type | Without Hints | With Hints | Notes |
---|---|---|---|
tuple |
โ list |
โ
tuple |
Order preserved |
set |
โ list |
โ
set |
Order may change |
frozenset |
โ list |
โ
frozenset |
Immutability preserved |
NumPy Types (Become Python Primitives)¶
import numpy as np
config = SerializationConfig(include_type_hints=True)
# NumPy scalars
np_int = np.int32(42)
np_float = np.float64(3.14)
np_bool = np.bool_(True)
# Arrays
np_array = np.array([[1, 2], [3, 4]])
np_zeros = np.zeros((3, 3))
NumPy Type | Without Hints | With Hints | Notes |
---|---|---|---|
np.int* |
โ int |
โ Exact type | All integer types |
np.float* |
โ float |
โ Exact type | Precision preserved |
np.bool_ |
โ bool |
โ
np.bool_ |
Type distinction |
np.array |
โ list |
โ
np.array |
Shape + dtype preserved |
np.complex* |
โ str /complex |
โ Exact type | Complex number handling |
Pandas Types (Become Lists/Dicts)¶
import pandas as pd
config = SerializationConfig(include_type_hints=True)
# DataFrame
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
# Series
series = pd.Series([1, 2, 3], name="my_series")
# Categorical data
categorical = pd.Series(pd.Categorical(['a', 'b', 'a', 'c']))
Pandas Type | Without Hints | With Hints | Notes |
---|---|---|---|
DataFrame |
โ list (records) |
โ
DataFrame |
Index + columns preserved |
Series |
โ dict |
โ
Series |
Index + name preserved |
Categorical |
โ dict (as object) |
โ
Categorical |
Categories + ordered preserved |
Machine Learning Types (Become Dicts)¶
import torch
from sklearn.linear_model import LogisticRegression
config = SerializationConfig(include_type_hints=True)
# PyTorch tensors
tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
# Sklearn models
model = LogisticRegression(random_state=42)
ML Type | Without Hints | With Hints | Notes |
---|---|---|---|
torch.Tensor |
โ list |
โ
torch.Tensor |
Device + dtype preserved |
sklearn.* |
โ dict |
โ Model class | Parameters preserved* |
*Note: Sklearn fitted state (weights) cannot be preserved without training data.
๐ง Configuration Options¶
Basic Deserialization¶
from datason.deserializers import deserialize_fast
# Default behavior - auto-detection only
result = deserialize_fast(data)
Type Preservation¶
from datason.config import SerializationConfig
# Full type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)
Security Limits¶
# Custom security limits
config = SerializationConfig(
max_depth=50, # Maximum nesting depth
max_size=100_000, # Maximum container size
include_type_hints=True
)
result = deserialize_fast(data, config=config)
DataFrame Orientations¶
from datason.config import DataFrameOrient
# Different DataFrame serialization formats
config = SerializationConfig(
dataframe_orient=DataFrameOrient.RECORDS, # Default
include_type_hints=True
)
# Available orientations:
# - RECORDS: [{"a": 1, "b": 4}, {"a": 2, "b": 5}]
# - SPLIT: {"index": [...], "columns": [...], "data": [...]}
# - INDEX: {0: {"a": 1, "b": 4}, 1: {"a": 2, "b": 5}}
# - VALUES: [[1, 4], [2, 5]] # Loses column names
๐จ Advanced Features¶
Automatic String Type Detection¶
# Smart pattern recognition without type hints
datetime_str = "2023-01-01T12:00:00"
uuid_str = "12345678-1234-5678-9012-123456789abc"
path_str = "/tmp/test/file.txt"
# Automatically converted to appropriate types
dt = deserialize_fast(datetime_str) # โ datetime object
uid = deserialize_fast(uuid_str) # โ UUID object
path = deserialize_fast(path_str) # โ Path object (if absolute)
Type Metadata Format¶
# New format (v0.6.0+)
{
"__datason_type__": "numpy.int32",
"__datason_value__": 42
}
# Legacy format (still supported)
{
"_type": "numpy.int32",
"value": 42
}
Complex Type Reconstruction¶
# Complex numbers - automatic detection
complex_dict = {"real": 1.0, "imag": 2.0}
result = deserialize_fast(complex_dict) # โ complex(1, 2)
# Decimal from string pattern
decimal_dict = {"value": "123.456"}
result = deserialize_fast(decimal_dict) # โ Decimal("123.456")
๐ก๏ธ Security Features¶
Depth Protection¶
# Protects against deeply nested attacks
config = SerializationConfig(max_depth=5)
deep_data = {"a": {"b": {"c": {"d": {"e": {"f": "too deep"}}}}}}
# Raises DeserializationSecurityError when depth > 5
Size Limits¶
# Protects against memory exhaustion
config = SerializationConfig(max_size=1000)
huge_dict = {f"key_{i}": i for i in range(10000)}
# Raises DeserializationSecurityError when size > 1000
Circular Reference Detection¶
# Safe handling of circular references
circular_data = {"self": None}
circular_data["self"] = circular_data
result = deserialize_fast(circular_data)
# Returns: {"self": {}} # Circular reference safely broken
๐ Performance Optimization¶
Ultra-Fast Basic Types¶
# Zero-overhead processing for common cases
result = deserialize_fast(42) # Immediate return
result = deserialize_fast("hello") # < 8 chars = immediate return
result = deserialize_fast(True) # Immediate return
result = deserialize_fast(None) # Immediate return
Caching Systems¶
New in v0.7.0: Configurable caching system with multiple scopes for different workflows.
import datason
from datason import CacheScope
# Choose cache scope based on your needs
datason.set_cache_scope(CacheScope.REQUEST) # For web APIs
datason.set_cache_scope(CacheScope.PROCESS) # For ML training
# Pattern caching for repeated strings (automatic)
datetime_str = "2023-01-01T12:00:00"
result1 = datason.deserialize_fast(datetime_str) # Parse and cache
result2 = datason.deserialize_fast(datetime_str) # Cache hit! (faster)
# Object caching for parsed results (automatic)
uuid_str = "12345678-1234-5678-9012-123456789abc"
result1 = datason.deserialize_fast(uuid_str) # Parse and cache UUID object
result2 = datason.deserialize_fast(uuid_str) # Return cached UUID object
See the Caching Documentation for detailed configuration and performance tuning.
Memory Pooling¶
# Container pooling for reduced allocations
# Automatically managed - no user configuration needed
large_nested = {"data": [{"item": i} for i in range(1000)]}
result = deserialize_fast(large_nested) # Uses pooled containers internally
๐งช Testing & Validation¶
Round-Trip Testing¶
def test_round_trip(original_data):
"""Test perfect round-trip preservation"""
config = SerializationConfig(include_type_hints=True)
# Serialize
serialized = serialize(original_data, config=config)
# Ensure JSON compatibility
json_str = json.dumps(serialized)
parsed = json.loads(json_str)
# Deserialize
deserialized = deserialize_fast(parsed, config=config)
# Validate
assert type(deserialized) is type(original_data)
assert deserialized == original_data
Type Validation¶
# Comprehensive type support validation
test_cases = [
(42, int), # Basic int
(np.int32(42), np.int32), # NumPy int32
((1, 2, 3), tuple), # Tuple preservation
(pd.Series([1, 2, 3]), pd.Series), # Pandas Series
(torch.tensor([1, 2]), torch.Tensor), # PyTorch tensor
]
config = SerializationConfig(include_type_hints=True)
for original, expected_type in test_cases:
serialized = serialize(original, config=config)
result = deserialize_fast(serialized, config=config)
assert type(result) is expected_type
๐ Migration Guide¶
From v0.5.x to v0.6.0¶
# Old approach
from datason.deserializers import deserialize
result = deserialize(data)
# New optimized approach (3.73x faster)
from datason.deserializers import deserialize_fast
result = deserialize_fast(data) # Drop-in replacement
# Enhanced type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)
Performance Migration¶
# Replace slow patterns
for item in large_dataset:
result = deserialize(item) # Slow
# With optimized batch processing
results = [deserialize_fast(item) for item in large_dataset] # Fast
๐ Best Practices¶
Type Preservation Strategy¶
# For maximum compatibility: include type hints
config = SerializationConfig(include_type_hints=True)
# For maximum performance: rely on auto-detection
# (only for basic types + datetime/UUID)
result = deserialize_fast(data) # No config needed
Error Handling¶
from datason.deserializers import DeserializationSecurityError
try:
result = deserialize_fast(untrusted_data, config=config)
except DeserializationSecurityError as e:
logger.warning(f"Security violation in deserialization: {e}")
# Handle security issue appropriately
except Exception as e:
logger.error(f"Deserialization failed: {e}")
# Handle other errors
Production Monitoring¶
import warnings
# Monitor security warnings in production
with warnings.catch_warnings(record=True) as w:
result = deserialize_fast(data)
for warning in w:
if "circular reference" in str(warning.message).lower():
logger.info(f"Circular reference detected and handled: {warning.message}")
This comprehensive deserialization system provides both high performance and complete type safety, making it suitable for everything from high-throughput data processing to complex scientific computing workflows.