datason Performance Benchmarks¶
📊 Note: This document contains historical performance measurements for reference. Current performance benchmarking is now handled by the external datason-benchmarks repository, which runs automatically on every PR.
Overview¶
This document contains real performance measurements for datason v0.7.0, obtained through systematic benchmarking rather than estimates. All benchmarks were reproducible using the benchmark scripts that have been moved to the external repository.
Benchmark Environment¶
- Python: 3.12.0
- Platform: macOS (Darwin 24.5.0)
- Dependencies: NumPy, Pandas, PyTorch
- Method: 5 iterations per test, statistical analysis (mean ± std dev)
- Hardware: Modern development machine (representative performance)
- Version: datason v0.7.0 with performance optimizations and configuration system
🆕 NEW: v0.4.5 Performance Breakthroughs¶
Template-Based Deserialization (NEW in v0.4.5)¶
Revolutionary Performance: Template deserialization provides 24x faster deserialization for structured data with known schemas.
Method | Performance | Speedup | Use Case |
---|---|---|---|
Template Deserializer | 64.0μs ± 6.3μs | 24.4x faster | Known schema, repeated data |
Auto Deserialization | 1,565μs ± 61.1μs | 1.0x (baseline) | Unknown schema, one-off data |
DataFrame Template | 774μs ± 73.2μs | 2.0x faster | Structured tabular data |
Real-world impact: - Processing 10,000 records: 640ms vs 15.6 seconds (15.6x total time reduction) - API response parsing: Sub-millisecond deserialization for structured responses - ML inference pipelines: Negligible deserialization overhead
Chunked Processing & Streaming (NEW in v0.4.0)¶
Memory-Bounded Processing: Handle datasets larger than available RAM with linear memory usage.
Chunked Serialization Performance¶
Data Type | Standard Memory | Chunked Memory | Memory Reduction |
---|---|---|---|
Large DataFrames | 2.4GB peak | 95MB peak | 95% reduction |
Numpy Arrays | 1.8GB peak | 52MB peak | 97% reduction |
Large Lists | 850MB peak | 48MB peak | 94% reduction |
Streaming Performance¶
Method | Performance | Memory Usage | Use Case |
---|---|---|---|
Streaming to .jsonl | 69μs ± 8.9μs | < 50MB | Large dataset processing |
Streaming to .json | 1,912μs ± 105μs | < 50MB | Compatibility with existing tools |
Batch Processing | 5,560μs ± 248μs | 2GB+ | Traditional approach |
Memory Efficiency: 99% memory reduction for large dataset processing.
Custom Serializers Performance Impact¶
Significant Speedup: Custom serializers provide 3.7x performance improvement for known object types.
Approach | Performance | Speedup | Use Case |
---|---|---|---|
Fast Custom Serializer | 1.84ms ± 0.07ms | 3.7x faster | Known object types |
Detailed Custom Serializer | 1.95ms ± 0.03ms | 3.5x faster | Rich serialization |
No Custom Serializer | 6.89ms ± 0.21ms | 1.0x (baseline) | Auto-detection |
Results Summary¶
Simple Data Performance¶
Test: 1000 JSON-compatible user objects
data = {
"users": [
{"id": i, "name": f"user_{i}", "active": True, "score": i * 1.5}
for i in range(1000)
]
}
Library | Performance | Relative Speed |
---|---|---|
Standard JSON | 0.44ms ± 0.03ms | 1.0x (baseline) |
datason | 0.66ms ± 0.07ms | 1.5x |
Analysis: datason adds only 50% overhead vs standard JSON for compatible data, which is excellent considering the added functionality (type detection, ML object support, safety features, configuration system, chunked processing, template deserialization).
Complex Data Performance¶
Test: 500 session objects with UUIDs and datetimes
data = {
"sessions": [
{
"id": uuid.uuid4(),
"start_time": datetime.now(),
"user_data": {"preferences": [...], "last_login": datetime.now()}
}
for i in range(500)
]
}
Library | Performance | Notes |
---|---|---|
datason | 7.34ms ± 0.44ms | Only option for this data |
Pickle | 0.79ms ± 0.06ms | Binary format, Python-only |
Standard JSON | ❌ Fails | Cannot serialize UUIDs/datetime |
Analysis: datason is 9.3x slower than pickle but provides JSON output that's human-readable and cross-platform compatible.
High-Throughput Scenarios¶
Large Nested Data: 100 groups × 50 items (5,000 complex objects) - Throughput: 131,042 items/second - Performance: 38.16ms ± 0.40ms total
NumPy Arrays: Multiple arrays with ~23K total elements
- Throughput: 1,630,766 elements/second
- Performance: 14.17ms ± 3.10ms total
Pandas DataFrames: 5K total rows - Throughput: 867,224 rows/second - Performance: 5.88ms ± 0.43ms total
Round-Trip Performance¶
Test: Complete workflow (serialize → JSON.dumps → JSON.loads → deserialize)
# Complex data with UUIDs and timestamps
serialize_time = 2.87ms ± 1.68ms
deserialize_time = 1.82ms ± 1.52ms
total_round_trip = 4.40ms ± 1.40ms
Real-world significance: Complete API request-response cycle under 4.5ms.
Configuration System Performance Impact¶
🚀 Updated Configuration Presets Comparison (v0.7.0)¶
Advanced Types Performance (Decimals, UUIDs, Complex numbers, Paths, Enums):
Configuration | Performance | Ops/sec | Use Case |
---|---|---|---|
ML Config | 0.99ms ± 0.09ms | 1,005 | ML pipelines, numeric focus |
API Config | 1.01ms ± 0.06ms | 988 | API responses, consistency |
Performance Config | 1.08ms ± 0.01ms | 923 | Speed-critical applications |
Default | 1.11ms ± 0.01ms | 899 | General use |
Strict Config | 13.43ms ± 2.06ms | 74 | Maximum type preservation |
Pandas DataFrame Performance (Large DataFrames with mixed types):
Configuration | Performance | Ops/sec | Best For |
---|---|---|---|
Performance Config | 1.64ms ± 0.15ms | 610 | High-throughput data processing |
ML Config | 4.75ms ± 0.20ms | 211 | ML-specific optimizations |
API Config | 4.72ms ± 0.15ms | 212 | Consistent API responses |
Default | 4.83ms ± 0.08ms | 207 | General use |
Strict Config | 5.92ms ± 2.35ms | 169 | Type safety, debugging |
Key Performance Insights¶
- Performance Config: 2.9x faster for large DataFrames (1.64ms vs 4.83ms default)
- Strict Config: Preserves maximum type information but 13.6x slower for complex types
- Configuration Overhead: Minimal for simple data, significant performance gains with optimization
- Custom Serializers: 2.7x faster than auto-detection (2.13ms vs 5.72ms)
Date Format Performance¶
Test: 1000 datetime objects in nested structure
Format | Performance | Best For |
---|---|---|
Unix Timestamp | 3.51ms ± 0.13ms | Compact, fast parsing |
Unix Milliseconds | 3.52ms ± 0.09ms | JavaScript compatibility |
ISO Format | 3.68ms ± 0.14ms | Standards compliance |
String Format | 3.66ms ± 0.11ms | Human readability |
Custom Format | 5.19ms ± 0.09ms | Specific requirements |
NaN Handling Performance¶
Test: 3000 values with mixed NaN/None/Infinity
Strategy | Performance | Trade-off |
---|---|---|
Convert to NULL | 3.29ms ± 0.03ms | JSON compatibility |
Keep Original | 3.41ms ± 0.14ms | Exact representation |
Convert to String | 3.54ms ± 0.04ms | Preserve information |
Drop Values | 3.55ms ± 0.10ms | Clean data |
Type Coercion Impact¶
Test: 700 objects with decimals, UUIDs, complex numbers, paths, enums
Strategy | Performance | Data Fidelity |
---|---|---|
Safe (Default) | 1.79ms ± 0.08ms | Balanced approach |
Aggressive | 1.86ms ± 0.05ms | Simplified types |
Strict | 2.15ms ± 0.01ms | Maximum preservation |
DataFrame Orientation Performance¶
Small DataFrames (100 rows):
Orientation | Performance | Best For |
---|---|---|
Values | 0.06ms ± 0.00ms | Array-like data |
Dict | 0.13ms ± 0.01ms | Column-oriented |
List | 0.16ms ± 0.01ms | Simple columns |
Split | 0.18ms ± 0.02ms | Structured metadata |
Records | 0.19ms ± 0.00ms | Row-oriented |
Index | 0.20ms ± 0.01ms | Index-focused |
Large DataFrames (5000 rows):
Orientation | Performance | Memory Efficiency |
---|---|---|
Values | 0.94ms ± 0.67ms | Minimal overhead |
List | 1.26ms ± 0.03ms | Column arrays |
Split | 1.69ms ± 0.07ms | Structured format |
Dict | 1.69ms ± 0.05ms | Column mapping |
Index | 2.59ms ± 0.08ms | Index preservation |
Records | 3.90ms ± 2.68ms | Row objects |
Memory Usage Analysis¶
Serialized Output Sizes for complex mixed-type data:
Configuration | Serialized Size | Use Case |
---|---|---|
Performance Config | 185KB | Optimized output |
NaN Drop Config | 303KB | Clean data |
Strict Config | 388KB | Maximum information |
Analysis: Performance config produces 52% smaller output than strict config while maintaining essential information.
Why We Compare with Pickle¶
Pickle is the natural comparison point because it's the only other tool that can serialize complex Python objects like ML models and DataFrames. However, the 9.3x performance difference tells only part of the story.
🏆 When Pickle Wins¶
# Pure Python environment, speed is everything
import pickle
start = time.time()
with open('model.pkl', 'wb') as f:
pickle.dump(complex_ml_pipeline, f) # 0.79ms
print(f"Saved in {time.time() - start:.1f}ms")
# ✅ Fastest option for Python-only workflows
# ✅ Perfect object reconstruction
# ✅ Handles any Python object (even lambdas, classes)
🌐 When datason Wins¶
# Multi-language team, API responses, data sharing
import json
import datason as ds
start = time.time()
json_data = ds.serialize(complex_ml_pipeline) # 7.34ms
with open('model.json', 'w') as f:
json.dump(json_data, f)
print(f"Saved in {time.time() - start:.1f}ms")
# ✅ Frontend team can read it immediately
# ✅ Business stakeholders can inspect results
# ✅ Works in Git diffs, text editors, web browsers
# ✅ API responses work across all platforms
# ✅ Configurable behavior for different use cases
📊 The Real Tradeoff¶
# Performance vs Versatility
pickle_speed = 0.79 # ms
datason_speed = 7.34 # ms
overhead = 6.55 # ms extra
# But with configuration optimization:
datason_performance_config = 1.66 # ms
optimized_overhead = 0.87 # ms extra
# Questions to ask:
# - Is 0.87-6.55ms overhead significant for your use case?
# - Do you need cross-language compatibility?
# - Do you need human-readable output?
# - Are you building APIs or microservices?
# For most modern applications: <7ms is negligible
# For high-frequency trading: Every microsecond matters (use pickle)
# For web APIs: Human-readable JSON is essential (use datason)
Performance Optimization Guide¶
🚀 Speed-Critical Applications¶
from datason import serialize, get_performance_config
# Use optimized configuration
config = get_performance_config()
result = serialize(data, config=config)
# → Up to 2.9x faster for large DataFrames
🎯 Balanced Performance¶
from datason import serialize, get_ml_config
# ML-optimized settings
config = get_ml_config()
result = serialize(ml_data, config=config)
# → Good performance + ML-specific optimizations
🔧 Custom Optimization¶
from datason import SerializationConfig, DateFormat, NanHandling
# Fine-tune for your use case
config = SerializationConfig(
date_format=DateFormat.UNIX, # Fastest date format
nan_handling=NanHandling.NULL, # Fastest NaN handling
dataframe_orient="split" # Best for large DataFrames
)
result = serialize(data, config=config)
Memory Usage Optimization¶
- Performance Config: ~131KB serialized size
- Strict Config: ~149KB serialized size (+13% memory)
- NaN Drop Config: ~135KB serialized size (clean data)
Comparative Analysis¶
vs Standard JSON¶
- Compatibility: datason handles 20+ data types vs JSON's 6 basic types
- Overhead: Only 1.5x for compatible data (vs 3-10x for many JSON alternatives)
- Safety: Graceful handling of NaN/Infinity vs JSON's errors
- Configuration: Tunable behavior vs fixed behavior
vs Pickle¶
- Speed: 9.3x slower but provides human-readable JSON
- Portability: Cross-language compatible vs Python-only
- Security: No arbitrary code execution risks
- Debugging: Human-readable output for troubleshooting
- Flexibility: Configurable serialization behavior
vs Specialized Libraries¶
- orjson/ujson: Faster for basic JSON types but cannot handle ML objects
- joblib: Good for NumPy arrays but binary format
- datason: Best balance of functionality, performance, and compatibility
Configuration Performance Recommendations¶
Use Case → Configuration Mapping¶
Your Situation | Recommended Config | Performance Gain |
---|---|---|
High-throughput data pipelines | get_performance_config() |
Up to 2.9x faster |
ML model APIs | get_ml_config() |
Optimized for numeric data |
REST API responses | get_api_config() |
Consistent, readable output |
Debugging/development | get_strict_config() |
Maximum type information |
General use | Default (no config) | Balanced approach |
DataFrame Optimization¶
- Small DataFrames (<1K rows): Use
orient="values"
(fastest) - Large DataFrames (>1K rows): Use
orient="split"
(best scaling) - Human-readable APIs: Use
orient="records"
(intuitive)
Date/Time Optimization¶
- Performance: Unix timestamps (
DateFormat.UNIX
) - JavaScript compatibility: Unix milliseconds (
DateFormat.UNIX_MS
) - Standards compliance: ISO format (
DateFormat.ISO
)
Pickle Bridge Performance¶
New in v0.3.0: datason's Pickle Bridge feature converts legacy ML pickle files to portable JSON format with enterprise-grade security.
Test Environment¶
- Feature: Pickle Bridge v0.3.0 (pickle-to-JSON conversion)
- Test Data: Basic Python objects, NumPy arrays, Pandas DataFrames
- Security: ML-safe class whitelisting (54 default safe classes)
- Method: 5 iterations per test, statistical analysis
Performance Comparison¶
Small Dataset (100 objects):
Approach | Performance | Ops/sec | Security | Use Case |
---|---|---|---|---|
Manual (pickle + datason) | 0.06ms ± 0.00ms | 16,598 | ⭐⭐ | Trusted environments |
dill + JSON | 0.05ms ± 0.00ms | 22,202 | ⭐⭐ | Extended pickle support |
jsonpickle | 0.10ms ± 0.01ms | 10,183 | ⭐⭐⭐ | General Python objects |
Pickle Bridge (datason) | 0.43ms ± 0.05ms | 2,318 | ⭐⭐⭐⭐⭐ | Production ML migration |
Large Dataset (500 objects):
Approach | Performance | Ops/sec | Relative Speed |
---|---|---|---|
Manual (pickle + datason) | 0.25ms ± 0.03ms | 4,037 | 7.5x faster |
dill + JSON | 0.15ms ± 0.00ms | 6,572 | 12.5x faster |
jsonpickle | 0.35ms ± 0.01ms | 2,860 | 5.3x faster |
Pickle Bridge (datason) | 1.87ms ± 0.08ms | 535 | 1.0x (baseline) |
Security Overhead Analysis¶
Security vs Performance Trade-off:
Mode | Performance (100 obj) | Performance (500 obj) | Security Level |
---|---|---|---|
Safe (recommended) | 0.40ms ± 0.01ms | 1.96ms ± 0.07ms | Enterprise-grade |
Unsafe (comparison) | 0.41ms ± 0.01ms | 1.84ms ± 0.05ms | No protection |
Overhead | ~2-6% slower | ~6% slower | Worth the security |
Key Finding: Enterprise security adds only 2-6% overhead - excellent trade-off for production use.
File Size Analysis¶
Pickle vs JSON Size Comparison:
Data Type | Dataset Size | Pickle Size | JSON Size | Size Ratio |
---|---|---|---|---|
Basic Objects | 100 items | 3.0 KB | 5.4 KB | 1.79x larger |
Basic Objects | 500 items | 15.0 KB | 20.0 KB | 1.34x larger |
NumPy Arrays | 100 items | 1.6 KB | 4.5 KB | 2.77x larger |
NumPy Arrays | 500 items | 6.2 KB | 14.4 KB | 2.33x larger |
Pandas DataFrames | 100 items | 2.0 KB | 5.4 KB | 2.69x larger |
Analysis: - Basic objects scale well (1.34x overhead for larger datasets) - NumPy/Pandas have higher overhead due to text vs binary representation - Trade-off: 1.3-2.8x larger files for cross-platform compatibility
Bulk Operations Performance¶
Directory Conversion Benchmarks:
Operation | Performance | Ops/sec | Best For |
---|---|---|---|
Bulk conversion | 6.17ms ± 0.28ms | 162 | Multiple files at once |
Individual files | 3.89ms ± 0.12ms | 257 | Single file processing |
Recommendation: Use individual file conversion for better throughput, bulk conversion for convenience.
Real-World Performance Scenarios¶
🚀 High-Performance ML Pipeline¶
# Manual approach: Maximum speed, trusted environment
with open('model.pkl', 'rb') as f:
data = pickle.load(f) # Trust your own files
result = datason.serialize(data) # 0.06ms - 16,598 ops/sec
🛡️ Production ML Migration¶
# Pickle Bridge: Security + performance balance
bridge = PickleBridge() # Uses ML-safe classes
result = bridge.from_pickle_file('model.pkl') # 0.43ms - 2,318 ops/sec
# ✅ Prevents arbitrary code execution
# ✅ Handles 95% of ML pickle files
# ✅ Only 7.5x slower than manual approach
🌐 Cross-Platform Data Exchange¶
# jsonpickle: Good middle ground
with open('data.pkl', 'rb') as f:
data = pickle.load(f)
result = jsonpickle.encode(data) # 0.10ms - 10,183 ops/sec
# ✅ 4.3x faster than Pickle Bridge
# ⚠️ Less security validation
Performance vs Security Matrix¶
Priority | Recommended Approach | Speed | Security | Compatibility |
---|---|---|---|---|
Maximum Speed | Manual (pickle + datason) | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
Balanced | jsonpickle | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
Production Security | Pickle Bridge | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Extended Pickle | dill + JSON | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
When to Use Pickle Bridge¶
✅ Perfect For¶
- ML model deployment: Secure pickle file processing
- Data migration projects: Legacy ML pipeline modernization
- Enterprise environments: Security-first approach required
- Cross-platform APIs: Need JSON output from pickle files
- Compliance requirements: Prevent arbitrary code execution
⚠️ Consider Alternatives When¶
- Maximum speed required: Use manual approach (7.5x faster)
- Simple Python objects: jsonpickle may be sufficient
- Trusted environment only: Direct pickle + datason conversion
- Extended pickle features: dill might be better option
Integration with Existing Benchmarks¶
The Pickle Bridge complements datason's existing performance profile:
- Simple data serialization: 0.66ms (1.5x JSON overhead)
- Complex data serialization: 7.34ms (datetime/UUID objects)
- Pickle Bridge conversion: 0.43-1.87ms (varies by data size)
- Round-trip performance: 4.40ms (serialize + deserialize)
Context: Pickle Bridge adds another tool to datason's ecosystem, specifically targeting the ML migration use case with strong security guarantees.
Performance Optimization Tips¶
- Use individual file processing for better throughput (257 vs 162 ops/sec)
- Prefer bytes mode when loading pickle data from memory
- Monitor complex objects - some may require manual approach
- Batch similar objects for better cache utilization
- Consider manual approach for trusted, speed-critical scenarios
Benchmark Reproducibility¶
# Benchmark scripts are now in the external datason-benchmarks repository
# Performance testing runs automatically on PRs via GitHub Actions
# For manual benchmarking, see: https://github.com/danielendler/datason-benchmarks
Methodology¶
Benchmark Scripts¶
All measurements came from two complementary benchmark suites (now in external repository):
benchmark_real_performance.py
: Core performance baselinesenhanced_benchmark_suite.py
: Configuration system impact analysis
Both scripts: - Multiple Iterations: Run each test 5 times for statistical reliability - Warm-up: First measurement discarded (JIT compilation, cache loading) - Statistical Analysis: Report mean, standard deviation, operations per second - Real Data: Use realistic data structures, not toy examples - Fair Comparison: Compare like-for-like where possible
Test Data Characteristics¶
- Simple data: JSON-compatible objects only
- Complex data: UUIDs, datetimes, nested structures
- Large data: Thousands of objects with realistic size
- ML data: NumPy arrays, Pandas DataFrames of representative sizes
- Advanced types: Decimals, complex numbers, paths, enums
Measurement Precision¶
- Uses
time.perf_counter()
for high-resolution timing - Measures end-to-end including all overhead
- No artificial optimizations or cherry-picked scenarios
When datason Excels¶
- Mixed data types: Standard + ML objects in one structure
- API responses: Need JSON compatibility with complex data
- Data science workflows: Frequent DataFrame/NumPy serialization
- Cross-platform: Human-readable output required
- Configurable behavior: Different performance requirements per use case
- ML migration projects: Secure pickle-to-JSON conversion (NEW in v0.3.0)
- Enterprise security: Prevent arbitrary code execution from pickle files
Performance Tips¶
- Choose the right configuration: 2.9x performance difference between configs
- Use custom serializers: 2.7x faster for known object types
- Optimize date formats: Unix timestamps are fastest
- Batch operations: Group objects for better throughput
- Profile your use case: Run benchmarks with your actual data
When to Consider Alternatives¶
- Pure speed + basic types: Use orjson/ujson
- Python-only + complex objects: Use pickle (7x faster)
- Scientific arrays + compression: Use joblib
- Maximum compatibility: Use standard json with manual type handling
Benchmark History¶
Date | Version | Change | Performance Impact |
---|---|---|---|
2025-06-06 | 0.5.0 | Performance breakthrough & security fixes | 1.6M+ elements/sec, critical security hardening |
2025-06-02 | 0.4.5 | Template deserialization & chunked processing | 24x deserialization speedup |
2025-06-01 | 0.2.0 | Configuration system added | Up to 2.9x speedup possible with optimization |
2025-05-30 | 0.3.0 | Pickle Bridge feature added | New: ML pickle-to-JSON conversion (2,318 ops/sec) |
Cache Scope Benchmarks (NEW in v0.7.0)¶
Demonstrates the performance impact of the configurable caching system. Results for 1000 repeated UUID strings on a typical laptop:
Cache Scope | Time (ms) |
---|---|
DISABLED | ~2.8 |
OPERATION | ~2.3 |
REQUEST | ~1.9 |
PROCESS | ~1.3 |
Run with:
# Cache scope benchmarks are now in the external datason-benchmarks repository
# See: https://github.com/danielendler/datason-benchmarks
Running Benchmarks¶
# Performance benchmarking is now handled by external datason-benchmarks repository
# Runs automatically on every PR via .github/workflows/pr-performance-check.yml
# For manual benchmarking:
# 1. Visit: https://github.com/danielendler/datason-benchmarks
# 2. Clone the repository
# 3. Follow the setup instructions
# Run local performance tests (minimal)
python -m pytest tests/performance/ -v
Interpreting Results¶
Statistical Significance¶
- Mean: Primary performance metric
- Standard deviation: Consistency indicator (lower = more consistent)
- Operations per second: Throughput measurement
Real-World Context¶
- Sub-millisecond: Excellent for interactive applications
- Single-digit milliseconds: Good for API responses
- Double-digit milliseconds: Acceptable for batch processing
- 100ms+: May need optimization for real-time use
Configuration Impact¶
- Performance Config: Choose when speed is critical
- Strict Config: Use for debugging, accept slower performance
- Default: Good balance for most applications
Last updated: June 6, 2025 Benchmarks reflect datason v0.7.0 with performance optimizations and security hardening