Performance Optimization¶
Optimizations for speed and memory efficiency in production environments, with benchmarking tools and configuration strategies.
🎯 Overview¶
datason provides multiple optimization strategies:
- Early Detection: Skip processing for JSON-compatible data
- Memory Streaming: Handle large datasets without full memory loading
- Configurable Limits: Prevent resource exhaustion attacks
- Built-in Benchmarking: Performance measurement tools
- Configuration Presets: Optimized settings for different use cases
⚡ Performance Configurations¶
Quick Performance Gains¶
import datason
# Use performance-optimized configuration with caching
config = datason.get_performance_config()
datason.set_cache_scope(datason.CacheScope.PROCESS) # Maximum cache performance
result = datason.serialize(large_dataset, config=config)
# Features:
# - Unix timestamps (fastest date format)
# - Split orientation for large DataFrames
# - Aggressive type coercion
# - Minimal metadata preservation
# - Process-scoped caching for maximum speed
Configuration Performance Comparison¶
Based on real benchmarking results:
Configuration | Performance (Complex Data) | Use Case |
---|---|---|
Performance Config | 0.54ms ± 0.03ms | Speed-critical applications |
ML Config | 0.56ms ± 0.08ms | ML pipelines |
Default | 0.58ms ± 0.01ms | General use |
API Config | 0.59ms ± 0.08ms | API responses |
Strict Config | 14.04ms ± 1.67ms | Maximum type preservation |
Performance Config is up to 25x faster than Strict Config!
📊 Real-World Benchmarks¶
Simple Data Performance¶
Test: 1000 JSON-compatible user objects
data = {
"users": [
{"id": i, "name": f"user_{i}", "active": True, "score": i * 1.5}
for i in range(1000)
]
}
Library | Performance | Relative Speed |
---|---|---|
Standard JSON | 0.40ms ± 0.03ms | 1.0x (baseline) |
datason | 0.62ms ± 0.02ms | 1.53x overhead |
Analysis: Only 53% overhead vs standard JSON for compatible data.
Complex Data Performance¶
Test: 500 session objects with UUIDs and datetimes
data = {
"sessions": [
{
"id": uuid.uuid4(),
"start_time": datetime.now(),
"user_data": {"preferences": [...], "last_login": datetime.now()}
}
for i in range(500)
]
}
Library | Performance | Notes |
---|---|---|
datason | 7.04ms ± 0.21ms | Only option for this data |
Pickle | 0.98ms ± 0.50ms | Binary format, Python-only |
Standard JSON | ❌ Fails | Cannot serialize UUIDs/datetime |
High-Throughput Scenarios¶
Large Nested Data: 100 groups × 50 items (5,000 complex objects) - Throughput: 44,624 items/second - Performance: 112.05ms ± 0.37ms total
NumPy Arrays: Multiple arrays with ~23K total elements
- Throughput: 908,188 elements/second
- Performance: 25.44ms ± 2.56ms total
Pandas DataFrames: 5K total rows - Throughput: 1,082,439 rows/second - Performance: 4.71ms ± 0.75ms total
🔧 Optimization Strategies¶
1. Choose the Right Configuration and Cache Scope¶
from datason import get_performance_config, get_ml_config, get_api_config, CacheScope
# Speed-critical applications
config = get_performance_config()
datason.set_cache_scope(CacheScope.PROCESS)
# Features: Unix timestamps, aggressive coercion, minimal metadata, maximum caching
# ML/AI workflows
config = get_ml_config()
datason.set_cache_scope(CacheScope.PROCESS)
# Features: Optimized for numeric data, preserved precision, persistent caching
# API responses
config = get_api_config()
datason.set_cache_scope(CacheScope.REQUEST)
# Features: Consistent output, human-readable dates, request-scoped caching
2. DataFrame Orientation Optimization¶
from datason import SerializationConfig, DataFrameOrient
# Small DataFrames (<1K rows) - use values
config = SerializationConfig(dataframe_orient=DataFrameOrient.VALUES)
# Performance: 0.07ms (fastest for small data)
# Large DataFrames (>1K rows) - use split
config = SerializationConfig(dataframe_orient=DataFrameOrient.SPLIT)
# Performance: 1.63ms (scales best for large data)
# API responses - use records
config = SerializationConfig(dataframe_orient=DataFrameOrient.RECORDS)
# Performance: Intuitive structure, moderate speed
3. Date Format Optimization¶
from datason import SerializationConfig, DateFormat
# Fastest: Unix timestamps
config = SerializationConfig(date_format=DateFormat.UNIX)
# Performance: 3.11ms ± 0.06ms
# JavaScript compatibility: Unix milliseconds
config = SerializationConfig(date_format=DateFormat.UNIX_MS)
# Performance: 3.26ms ± 0.05ms
# Human readable: ISO format
config = SerializationConfig(date_format=DateFormat.ISO)
# Performance: 3.46ms ± 0.17ms
4. NaN Handling Optimization¶
from datason import SerializationConfig, NanHandling
# Fastest: Convert to NULL
config = SerializationConfig(nan_handling=NanHandling.NULL)
# Performance: 2.83ms ± 0.09ms
# Preserve information: Convert to string
config = SerializationConfig(nan_handling=NanHandling.STRING)
# Performance: 2.89ms ± 0.08ms
# Clean data: Drop values
config = SerializationConfig(nan_handling=NanHandling.DROP)
# Performance: 3.00ms ± 0.08ms
5. Caching Performance Optimization¶
New in v0.7.0: Configurable caching system provides 50-200% performance improvements.
import datason
from datason import CacheScope, get_cache_metrics
# Choose cache scope based on use case
datason.set_cache_scope(CacheScope.PROCESS) # 150-200% performance (ML training)
datason.set_cache_scope(CacheScope.REQUEST) # 130-150% performance (web APIs)
datason.set_cache_scope(CacheScope.OPERATION) # 110-120% performance (testing/default)
datason.set_cache_scope(CacheScope.DISABLED) # Baseline performance (debugging)
# Monitor cache effectiveness
result = datason.deserialize_fast(data_with_datetimes_and_uuids)
metrics = get_cache_metrics()
print(f"Cache hit rate: {metrics[CacheScope.PROCESS].hit_rate:.1%}")
Cache Performance by Scope:
Cache Scope | Performance Gain | Best For | Memory Usage |
---|---|---|---|
Process | 150-200% | ML training, analytics | Higher (persistent) |
Request | 130-150% | Web APIs, batch processing | Medium (request-local) |
Operation | 110-120% | Testing, default behavior | Low (operation-local) |
Disabled | Baseline | Debugging, profiling | Minimal (no cache) |
Caching is most effective with: - Repeated datetime strings (ISO format, Unix timestamps) - Repeated UUID strings - Large datasets with duplicate patterns - Deserialization-heavy workloads
See the Caching Documentation for detailed configuration and usage patterns.
🚀 Memory Optimization¶
Large Dataset Handling¶
# For very large datasets, use chunking
def serialize_large_dataset(data, chunk_size=1000):
"""Process large datasets in chunks to avoid memory issues."""
if isinstance(data, list) and len(data) > chunk_size:
chunks = []
for i in range(0, len(data), chunk_size):
chunk = data[i:i+chunk_size]
chunks.append(datason.serialize(chunk))
return {"chunks": chunks, "total_size": len(data)}
return datason.serialize(data)
Memory-Efficient Configuration¶
# Minimize memory usage
config = SerializationConfig(
# Reduce precision when acceptable
preserve_float_precision=False,
# Skip unnecessary metadata
include_type_info=False,
include_shape_info=False,
# Use efficient representations
dataframe_orient=DataFrameOrient.VALUES,
date_format=DateFormat.UNIX
)
Streaming for Large Files¶
import json
def stream_serialize_to_file(data, filename, chunk_size=1000):
"""Stream serialization to file for memory efficiency."""
with open(filename, 'w') as f:
if isinstance(data, (list, tuple)) and len(data) > chunk_size:
f.write('[')
for i, item in enumerate(data):
if i > 0:
f.write(',')
serialized_item = datason.serialize(item)
json.dump(serialized_item, f)
f.write(']')
else:
result = datason.serialize(data)
json.dump(result, f)
📈 Monitoring Performance¶
Built-in Benchmarking¶
import time
import datason
# Measure serialization performance
def benchmark_serialization(data, iterations=5):
"""Benchmark datason serialization performance."""
times = []
# Warm-up
datason.serialize(data)
# Benchmark
for _ in range(iterations):
start = time.perf_counter()
result = datason.serialize(data)
end = time.perf_counter()
times.append(end - start)
avg_time = sum(times) / len(times)
ops_per_sec = 1.0 / avg_time if avg_time > 0 else 0
return {
'average_time_ms': avg_time * 1000,
'operations_per_second': ops_per_sec,
'min_time_ms': min(times) * 1000,
'max_time_ms': max(times) * 1000
}
# Example usage
data = {"users": [{"id": i} for i in range(1000)]}
stats = benchmark_serialization(data)
print(f"Average: {stats['average_time_ms']:.2f}ms")
print(f"Throughput: {stats['operations_per_second']:.0f} ops/sec")
Performance Profiling¶
import cProfile
import datason
def profile_serialization(data):
"""Profile datason serialization to identify bottlenecks."""
profiler = cProfile.Profile()
profiler.enable()
result = datason.serialize(data)
profiler.disable()
profiler.print_stats(sort='cumulative')
return result
# Example: Profile complex data serialization
complex_data = {
'dataframes': [pd.DataFrame(np.random.randn(100, 10)) for _ in range(5)],
'arrays': [np.random.randn(1000) for _ in range(10)],
'timestamps': [datetime.now() for _ in range(100)]
}
profile_serialization(complex_data)
🔍 Performance Debugging¶
Identify Slow Objects¶
import time
import datason
def find_slow_objects(data_dict, threshold_ms=10):
"""Find objects that take longer than threshold to serialize."""
slow_objects = []
for key, value in data_dict.items():
start = time.perf_counter()
try:
datason.serialize(value)
duration_ms = (time.perf_counter() - start) * 1000
if duration_ms > threshold_ms:
slow_objects.append({
'key': key,
'duration_ms': duration_ms,
'type': type(value).__name__,
'size': len(str(value)) if hasattr(value, '__len__') else 'unknown'
})
except Exception as e:
slow_objects.append({
'key': key,
'duration_ms': float('inf'),
'type': type(value).__name__,
'error': str(e)
})
return sorted(slow_objects, key=lambda x: x.get('duration_ms', 0), reverse=True)
# Example usage
data = {
'small_list': list(range(100)),
'large_dataframe': pd.DataFrame(np.random.randn(10000, 50)),
'complex_dict': {f'key_{i}': {'nested': list(range(100))} for i in range(100)}
}
slow_items = find_slow_objects(data)
for item in slow_items:
print(f"{item['key']}: {item['duration_ms']:.2f}ms ({item['type']})")
Memory Usage Monitoring¶
import psutil
import os
import datason
def monitor_memory_usage(data):
"""Monitor memory usage during serialization."""
process = psutil.Process(os.getpid())
# Baseline memory
baseline_mb = process.memory_info().rss / 1024 / 1024
# Serialize and measure
result = datason.serialize(data)
peak_mb = process.memory_info().rss / 1024 / 1024
return {
'baseline_memory_mb': baseline_mb,
'peak_memory_mb': peak_mb,
'memory_increase_mb': peak_mb - baseline_mb,
'serialized_size_chars': len(str(result))
}
# Example
large_data = {'arrays': [np.random.randn(10000) for _ in range(10)]}
memory_stats = monitor_memory_usage(large_data)
print(f"Memory increase: {memory_stats['memory_increase_mb']:.1f} MB")
🎯 Production Optimizations¶
API Response Optimization¶
# Optimized for API responses
def serialize_api_response(data):
"""Optimized serialization for API responses."""
config = datason.get_api_config()
# Add response metadata efficiently
response_data = {
'data': data,
'timestamp': time.time(), # Unix timestamp for speed
'version': '1.0'
}
return datason.serialize(response_data, config=config)
Batch Processing Optimization¶
# Optimized for batch processing
def serialize_batch(data_list, batch_size=100):
"""Process data in optimized batches."""
config = datason.get_performance_config()
results = []
for i in range(0, len(data_list), batch_size):
batch = data_list[i:i+batch_size]
batch_result = datason.serialize(batch, config=config)
results.append(batch_result)
return results
Database Export Optimization¶
# Optimized for database export
def serialize_for_database(df):
"""Optimized DataFrame serialization for database export."""
config = SerializationConfig(
dataframe_orient=DataFrameOrient.SPLIT, # Most efficient
date_format=DateFormat.UNIX, # Numeric timestamps
nan_handling=NanHandling.NULL, # Database-compatible nulls
preserve_numeric_precision=True # Maintain data integrity
)
return datason.serialize(df, config=config)
🏆 Performance Best Practices¶
1. Profile Before Optimizing¶
# Always measure first
stats = benchmark_serialization(your_data)
print(f"Baseline: {stats['average_time_ms']:.2f}ms")
2. Choose Configuration Wisely¶
# For speed-critical applications
config = datason.get_performance_config()
# For human-readable output
config = datason.get_api_config()
# For ML workflows
config = datason.get_ml_config()
3. Optimize Data Structure¶
# Use efficient pandas dtypes
df = df.astype({
'category_col': 'category',
'int_col': 'int32',
'float_col': 'float32'
})
# Pre-convert problematic types
data = {k: str(v) if isinstance(v, complex_type) else v
for k, v in data.items()}
4. Monitor in Production¶
# Add timing to production code
start = time.perf_counter()
result = datason.serialize(data)
duration = time.perf_counter() - start
if duration > 0.1: # Log slow serializations
logger.warning(f"Slow serialization: {duration:.3f}s for {type(data)}")
🔗 Related Features¶
- Configuration System - Performance-tuned configurations
- Benchmarking - Detailed performance analysis
- Core Serialization - Understanding the base layer
- Pandas Integration - DataFrame-specific optimizations
🚀 Next Steps¶
- Configuration → - Fine-tune performance settings
- Benchmarking → - Run your own performance tests
- Production Deployment → - Production best practices
📊 Performance Analysis¶
- Benchmarking - Detailed performance analysis
- Core Strategy - Algorithmic optimization strategies
- Production Performance - Production environment tips
- CI Performance - Continuous integration optimization
- Benchmarking → - Run your own performance tests
- Production Deployment → - Production best practices