datason Performance Benchmarks¶

📊 Note: This document contains historical performance measurements for reference. Current performance benchmarking is now handled by the external datason-benchmarks repository, which runs automatically on every PR.

Overview¶

This document contains real performance measurements for datason v0.7.0, obtained through systematic benchmarking rather than estimates. All benchmarks were reproducible using the benchmark scripts that have been moved to the external repository.

Benchmark Environment¶

Python: 3.12.0
Platform: macOS (Darwin 24.5.0)
Dependencies: NumPy, Pandas, PyTorch
Method: 5 iterations per test, statistical analysis (mean ± std dev)
Hardware: Modern development machine (representative performance)
Version: datason v0.7.0 with performance optimizations and configuration system

🆕 NEW: v0.4.5 Performance Breakthroughs¶

Template-Based Deserialization (NEW in v0.4.5)¶

Revolutionary Performance: Template deserialization provides 24x faster deserialization for structured data with known schemas.

Method	Performance	Speedup	Use Case
Template Deserializer	64.0μs ± 6.3μs	24.4x faster	Known schema, repeated data
Auto Deserialization	1,565μs ± 61.1μs	1.0x (baseline)	Unknown schema, one-off data
DataFrame Template	774μs ± 73.2μs	2.0x faster	Structured tabular data

Real-world impact: - Processing 10,000 records: 640ms vs 15.6 seconds (15.6x total time reduction) - API response parsing: Sub-millisecond deserialization for structured responses - ML inference pipelines: Negligible deserialization overhead

Chunked Processing & Streaming (NEW in v0.4.0)¶

Memory-Bounded Processing: Handle datasets larger than available RAM with linear memory usage.

Chunked Serialization Performance¶

Data Type	Standard Memory	Chunked Memory	Memory Reduction
Large DataFrames	2.4GB peak	95MB peak	95% reduction
Numpy Arrays	1.8GB peak	52MB peak	97% reduction
Large Lists	850MB peak	48MB peak	94% reduction

Streaming Performance¶

Method	Performance	Memory Usage	Use Case
Streaming to .jsonl	69μs ± 8.9μs	< 50MB	Large dataset processing
Streaming to .json	1,912μs ± 105μs	< 50MB	Compatibility with existing tools
Batch Processing	5,560μs ± 248μs	2GB+	Traditional approach

Memory Efficiency: 99% memory reduction for large dataset processing.

Custom Serializers Performance Impact¶

Significant Speedup: Custom serializers provide 3.7x performance improvement for known object types.

Approach	Performance	Speedup	Use Case
Fast Custom Serializer	1.84ms ± 0.07ms	3.7x faster	Known object types
Detailed Custom Serializer	1.95ms ± 0.03ms	3.5x faster	Rich serialization
No Custom Serializer	6.89ms ± 0.21ms	1.0x (baseline)	Auto-detection

Results Summary¶

Simple Data Performance¶

Test: 1000 JSON-compatible user objects

data = {
    "users": [
        {"id": i, "name": f"user_{i}", "active": True, "score": i * 1.5}
        for i in range(1000)
    ]
}

Library	Performance	Relative Speed
Standard JSON	0.44ms ± 0.03ms	1.0x (baseline)
datason	0.66ms ± 0.07ms	1.5x

Analysis: datason adds only 50% overhead vs standard JSON for compatible data, which is excellent considering the added functionality (type detection, ML object support, safety features, configuration system, chunked processing, template deserialization).

Complex Data Performance¶

Test: 500 session objects with UUIDs and datetimes

data = {
    "sessions": [
        {
            "id": uuid.uuid4(),
            "start_time": datetime.now(),
            "user_data": {"preferences": [...], "last_login": datetime.now()}
        }
        for i in range(500)
    ]
}

Library	Performance	Notes
datason	7.34ms ± 0.44ms	Only option for this data
Pickle	0.79ms ± 0.06ms	Binary format, Python-only
Standard JSON	❌ Fails	Cannot serialize UUIDs/datetime

Analysis: datason is 9.3x slower than pickle but provides JSON output that's human-readable and cross-platform compatible.

High-Throughput Scenarios¶

Large Nested Data: 100 groups × 50 items (5,000 complex objects) - Throughput: 131,042 items/second - Performance: 38.16ms ± 0.40ms total

NumPy Arrays: Multiple arrays with ~23K total elements - Throughput: 1,630,766 elements/second
- Performance: 14.17ms ± 3.10ms total

Pandas DataFrames: 5K total rows - Throughput: 867,224 rows/second - Performance: 5.88ms ± 0.43ms total

Round-Trip Performance¶

Test: Complete workflow (serialize → JSON.dumps → JSON.loads → deserialize)

# Complex data with UUIDs and timestamps
serialize_time    = 2.87ms ± 1.68ms
deserialize_time  = 1.82ms ± 1.52ms
total_round_trip  = 4.40ms ± 1.40ms

Real-world significance: Complete API request-response cycle under 4.5ms.

Configuration System Performance Impact¶

🚀 Updated Configuration Presets Comparison (v0.7.0)¶

Advanced Types Performance (Decimals, UUIDs, Complex numbers, Paths, Enums):

Configuration	Performance	Ops/sec	Use Case
ML Config	0.99ms ± 0.09ms	1,005	ML pipelines, numeric focus
API Config	1.01ms ± 0.06ms	988	API responses, consistency
Performance Config	1.08ms ± 0.01ms	923	Speed-critical applications
Default	1.11ms ± 0.01ms	899	General use
Strict Config	13.43ms ± 2.06ms	74	Maximum type preservation

Pandas DataFrame Performance (Large DataFrames with mixed types):

Configuration	Performance	Ops/sec	Best For
Performance Config	1.64ms ± 0.15ms	610	High-throughput data processing
ML Config	4.75ms ± 0.20ms	211	ML-specific optimizations
API Config	4.72ms ± 0.15ms	212	Consistent API responses
Default	4.83ms ± 0.08ms	207	General use
Strict Config	5.92ms ± 2.35ms	169	Type safety, debugging

Key Performance Insights¶

Performance Config: 2.9x faster for large DataFrames (1.64ms vs 4.83ms default)
Strict Config: Preserves maximum type information but 13.6x slower for complex types
Configuration Overhead: Minimal for simple data, significant performance gains with optimization
Custom Serializers: 2.7x faster than auto-detection (2.13ms vs 5.72ms)

Date Format Performance¶

Test: 1000 datetime objects in nested structure

Format	Performance	Best For
Unix Timestamp	3.51ms ± 0.13ms	Compact, fast parsing
Unix Milliseconds	3.52ms ± 0.09ms	JavaScript compatibility
ISO Format	3.68ms ± 0.14ms	Standards compliance
String Format	3.66ms ± 0.11ms	Human readability
Custom Format	5.19ms ± 0.09ms	Specific requirements

NaN Handling Performance¶

Test: 3000 values with mixed NaN/None/Infinity

Strategy	Performance	Trade-off
Convert to NULL	3.29ms ± 0.03ms	JSON compatibility
Keep Original	3.41ms ± 0.14ms	Exact representation
Convert to String	3.54ms ± 0.04ms	Preserve information
Drop Values	3.55ms ± 0.10ms	Clean data

Type Coercion Impact¶

Test: 700 objects with decimals, UUIDs, complex numbers, paths, enums

Strategy	Performance	Data Fidelity
Safe (Default)	1.79ms ± 0.08ms	Balanced approach
Aggressive	1.86ms ± 0.05ms	Simplified types
Strict	2.15ms ± 0.01ms	Maximum preservation

DataFrame Orientation Performance¶

Small DataFrames (100 rows):

Orientation	Performance	Best For
Values	0.06ms ± 0.00ms	Array-like data
Dict	0.13ms ± 0.01ms	Column-oriented
List	0.16ms ± 0.01ms	Simple columns
Split	0.18ms ± 0.02ms	Structured metadata
Records	0.19ms ± 0.00ms	Row-oriented
Index	0.20ms ± 0.01ms	Index-focused

Large DataFrames (5000 rows):

Orientation	Performance	Memory Efficiency
Values	0.94ms ± 0.67ms	Minimal overhead
List	1.26ms ± 0.03ms	Column arrays
Split	1.69ms ± 0.07ms	Structured format
Dict	1.69ms ± 0.05ms	Column mapping
Index	2.59ms ± 0.08ms	Index preservation
Records	3.90ms ± 2.68ms	Row objects

Memory Usage Analysis¶

Serialized Output Sizes for complex mixed-type data:

Configuration	Serialized Size	Use Case
Performance Config	185KB	Optimized output
NaN Drop Config	303KB	Clean data
Strict Config	388KB	Maximum information

Analysis: Performance config produces 52% smaller output than strict config while maintaining essential information.

Why We Compare with Pickle¶

Pickle is the natural comparison point because it's the only other tool that can serialize complex Python objects like ML models and DataFrames. However, the 9.3x performance difference tells only part of the story.

🏆 When Pickle Wins¶

# Pure Python environment, speed is everything
import pickle
start = time.time()
with open('model.pkl', 'wb') as f:
    pickle.dump(complex_ml_pipeline, f)  # 0.79ms
print(f"Saved in {time.time() - start:.1f}ms")

# ✅ Fastest option for Python-only workflows
# ✅ Perfect object reconstruction  
# ✅ Handles any Python object (even lambdas, classes)

🌐 When datason Wins¶

# Multi-language team, API responses, data sharing
import json
import datason as ds

start = time.time()
json_data = ds.serialize(complex_ml_pipeline)  # 7.34ms
with open('model.json', 'w') as f:
    json.dump(json_data, f)
print(f"Saved in {time.time() - start:.1f}ms")

# ✅ Frontend team can read it immediately
# ✅ Business stakeholders can inspect results  
# ✅ Works in Git diffs, text editors, web browsers
# ✅ API responses work across all platforms
# ✅ Configurable behavior for different use cases

📊 The Real Tradeoff¶

# Performance vs Versatility
pickle_speed = 0.79  # ms
datason_speed = 7.34  # ms
overhead = 6.55  # ms extra

# But with configuration optimization:
datason_performance_config = 1.66  # ms
optimized_overhead = 0.87  # ms extra

# Questions to ask:
# - Is 0.87-6.55ms overhead significant for your use case?
# - Do you need cross-language compatibility?  
# - Do you need human-readable output?
# - Are you building APIs or microservices?

# For most modern applications: <7ms is negligible
# For high-frequency trading: Every microsecond matters (use pickle)
# For web APIs: Human-readable JSON is essential (use datason)

Performance Optimization Guide¶

🚀 Speed-Critical Applications¶

from datason import serialize, get_performance_config

# Use optimized configuration
config = get_performance_config()
result = serialize(data, config=config)
# → Up to 2.9x faster for large DataFrames

🎯 Balanced Performance¶

from datason import serialize, get_ml_config

# ML-optimized settings
config = get_ml_config()
result = serialize(ml_data, config=config)
# → Good performance + ML-specific optimizations

🔧 Custom Optimization¶

from datason import SerializationConfig, DateFormat, NanHandling

# Fine-tune for your use case
config = SerializationConfig(
    date_format=DateFormat.UNIX,  # Fastest date format
    nan_handling=NanHandling.NULL,  # Fastest NaN handling
    dataframe_orient="split"  # Best for large DataFrames
)
result = serialize(data, config=config)

Memory Usage Optimization¶

Performance Config: ~131KB serialized size
Strict Config: ~149KB serialized size (+13% memory)
NaN Drop Config: ~135KB serialized size (clean data)

Comparative Analysis¶

vs Standard JSON¶

Compatibility: datason handles 20+ data types vs JSON's 6 basic types
Overhead: Only 1.5x for compatible data (vs 3-10x for many JSON alternatives)
Safety: Graceful handling of NaN/Infinity vs JSON's errors
Configuration: Tunable behavior vs fixed behavior

vs Pickle¶

Speed: 9.3x slower but provides human-readable JSON
Portability: Cross-language compatible vs Python-only
Security: No arbitrary code execution risks
Debugging: Human-readable output for troubleshooting
Flexibility: Configurable serialization behavior

vs Specialized Libraries¶

orjson/ujson: Faster for basic JSON types but cannot handle ML objects
joblib: Good for NumPy arrays but binary format
datason: Best balance of functionality, performance, and compatibility

Configuration Performance Recommendations¶

Use Case → Configuration Mapping¶

Your Situation	Recommended Config	Performance Gain
High-throughput data pipelines	`get_performance_config()`	Up to 2.9x faster
ML model APIs	`get_ml_config()`	Optimized for numeric data
REST API responses	`get_api_config()`	Consistent, readable output
Debugging/development	`get_strict_config()`	Maximum type information
General use	Default (no config)	Balanced approach

DataFrame Optimization¶

Small DataFrames (<1K rows): Use orient="values" (fastest)
Large DataFrames (>1K rows): Use orient="split" (best scaling)
Human-readable APIs: Use orient="records" (intuitive)

Date/Time Optimization¶

Performance: Unix timestamps (DateFormat.UNIX)
JavaScript compatibility: Unix milliseconds (DateFormat.UNIX_MS)
Standards compliance: ISO format (DateFormat.ISO)

Pickle Bridge Performance¶

New in v0.3.0: datason's Pickle Bridge feature converts legacy ML pickle files to portable JSON format with enterprise-grade security.

Test Environment¶

Feature: Pickle Bridge v0.3.0 (pickle-to-JSON conversion)
Test Data: Basic Python objects, NumPy arrays, Pandas DataFrames
Security: ML-safe class whitelisting (54 default safe classes)
Method: 5 iterations per test, statistical analysis

Performance Comparison¶

Small Dataset (100 objects):

Approach	Performance	Ops/sec	Security	Use Case
Manual (pickle + datason)	0.06ms ± 0.00ms	16,598	⭐⭐	Trusted environments
dill + JSON	0.05ms ± 0.00ms	22,202	⭐⭐	Extended pickle support
jsonpickle	0.10ms ± 0.01ms	10,183	⭐⭐⭐	General Python objects
Pickle Bridge (datason)	0.43ms ± 0.05ms	2,318	⭐⭐⭐⭐⭐	Production ML migration

Large Dataset (500 objects):

Approach	Performance	Ops/sec	Relative Speed
Manual (pickle + datason)	0.25ms ± 0.03ms	4,037	7.5x faster
dill + JSON	0.15ms ± 0.00ms	6,572	12.5x faster
jsonpickle	0.35ms ± 0.01ms	2,860	5.3x faster
Pickle Bridge (datason)	1.87ms ± 0.08ms	535	1.0x (baseline)

Security Overhead Analysis¶

Security vs Performance Trade-off:

Mode	Performance (100 obj)	Performance (500 obj)	Security Level
Safe (recommended)	0.40ms ± 0.01ms	1.96ms ± 0.07ms	Enterprise-grade
Unsafe (comparison)	0.41ms ± 0.01ms	1.84ms ± 0.05ms	No protection
Overhead	~2-6% slower	~6% slower	Worth the security

Key Finding: Enterprise security adds only 2-6% overhead - excellent trade-off for production use.

File Size Analysis¶

Pickle vs JSON Size Comparison:

Data Type	Dataset Size	Pickle Size	JSON Size	Size Ratio
Basic Objects	100 items	3.0 KB	5.4 KB	1.79x larger
Basic Objects	500 items	15.0 KB	20.0 KB	1.34x larger
NumPy Arrays	100 items	1.6 KB	4.5 KB	2.77x larger
NumPy Arrays	500 items	6.2 KB	14.4 KB	2.33x larger
Pandas DataFrames	100 items	2.0 KB	5.4 KB	2.69x larger

Analysis: - Basic objects scale well (1.34x overhead for larger datasets) - NumPy/Pandas have higher overhead due to text vs binary representation - Trade-off: 1.3-2.8x larger files for cross-platform compatibility

Bulk Operations Performance¶

Directory Conversion Benchmarks:

Operation	Performance	Ops/sec	Best For
Bulk conversion	6.17ms ± 0.28ms	162	Multiple files at once
Individual files	3.89ms ± 0.12ms	257	Single file processing

Recommendation: Use individual file conversion for better throughput, bulk conversion for convenience.

Real-World Performance Scenarios¶

🚀 High-Performance ML Pipeline¶

# Manual approach: Maximum speed, trusted environment
with open('model.pkl', 'rb') as f:
    data = pickle.load(f)  # Trust your own files
result = datason.serialize(data)  # 0.06ms - 16,598 ops/sec

🛡️ Production ML Migration¶

# Pickle Bridge: Security + performance balance
bridge = PickleBridge()  # Uses ML-safe classes
result = bridge.from_pickle_file('model.pkl')  # 0.43ms - 2,318 ops/sec
# ✅ Prevents arbitrary code execution
# ✅ Handles 95% of ML pickle files
# ✅ Only 7.5x slower than manual approach

🌐 Cross-Platform Data Exchange¶

# jsonpickle: Good middle ground
with open('data.pkl', 'rb') as f:
    data = pickle.load(f)
result = jsonpickle.encode(data)  # 0.10ms - 10,183 ops/sec
# ✅ 4.3x faster than Pickle Bridge
# ⚠️ Less security validation

Performance vs Security Matrix¶

Priority	Recommended Approach	Speed	Security	Compatibility
Maximum Speed	Manual (pickle + datason)	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐
Balanced	jsonpickle	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Production Security	Pickle Bridge	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Extended Pickle	dill + JSON	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐

When to Use Pickle Bridge¶

✅ Perfect For¶

ML model deployment: Secure pickle file processing
Data migration projects: Legacy ML pipeline modernization
Enterprise environments: Security-first approach required
Cross-platform APIs: Need JSON output from pickle files
Compliance requirements: Prevent arbitrary code execution

⚠️ Consider Alternatives When¶

Maximum speed required: Use manual approach (7.5x faster)
Simple Python objects: jsonpickle may be sufficient
Trusted environment only: Direct pickle + datason conversion
Extended pickle features: dill might be better option

Integration with Existing Benchmarks¶

The Pickle Bridge complements datason's existing performance profile:

Simple data serialization: 0.66ms (1.5x JSON overhead)
Complex data serialization: 7.34ms (datetime/UUID objects)
Pickle Bridge conversion: 0.43-1.87ms (varies by data size)
Round-trip performance: 4.40ms (serialize + deserialize)

Context: Pickle Bridge adds another tool to datason's ecosystem, specifically targeting the ML migration use case with strong security guarantees.

Performance Optimization Tips¶

Use individual file processing for better throughput (257 vs 162 ops/sec)
Prefer bytes mode when loading pickle data from memory
Monitor complex objects - some may require manual approach
Batch similar objects for better cache utilization
Consider manual approach for trusted, speed-critical scenarios

Benchmark Reproducibility¶

# Benchmark scripts are now in the external datason-benchmarks repository
# Performance testing runs automatically on PRs via GitHub Actions
# For manual benchmarking, see: https://github.com/danielendler/datason-benchmarks

Methodology¶

Benchmark Scripts¶

All measurements came from two complementary benchmark suites (now in external repository):

benchmark_real_performance.py: Core performance baselines
enhanced_benchmark_suite.py: Configuration system impact analysis

Both scripts: - Multiple Iterations: Run each test 5 times for statistical reliability - Warm-up: First measurement discarded (JIT compilation, cache loading) - Statistical Analysis: Report mean, standard deviation, operations per second - Real Data: Use realistic data structures, not toy examples - Fair Comparison: Compare like-for-like where possible

Test Data Characteristics¶

Simple data: JSON-compatible objects only
Complex data: UUIDs, datetimes, nested structures
Large data: Thousands of objects with realistic size
ML data: NumPy arrays, Pandas DataFrames of representative sizes
Advanced types: Decimals, complex numbers, paths, enums

Measurement Precision¶

Uses time.perf_counter() for high-resolution timing
Measures end-to-end including all overhead
No artificial optimizations or cherry-picked scenarios

When datason Excels¶

Mixed data types: Standard + ML objects in one structure
API responses: Need JSON compatibility with complex data
Data science workflows: Frequent DataFrame/NumPy serialization
Cross-platform: Human-readable output required
Configurable behavior: Different performance requirements per use case
ML migration projects: Secure pickle-to-JSON conversion (NEW in v0.3.0)
Enterprise security: Prevent arbitrary code execution from pickle files

Performance Tips¶

Choose the right configuration: 2.9x performance difference between configs
Use custom serializers: 2.7x faster for known object types
Optimize date formats: Unix timestamps are fastest
Batch operations: Group objects for better throughput
Profile your use case: Run benchmarks with your actual data

When to Consider Alternatives¶

Pure speed + basic types: Use orjson/ujson
Python-only + complex objects: Use pickle (7x faster)
Scientific arrays + compression: Use joblib
Maximum compatibility: Use standard json with manual type handling

Benchmark History¶

Date	Version	Change	Performance Impact
2025-06-06	0.5.0	Performance breakthrough & security fixes	1.6M+ elements/sec, critical security hardening
2025-06-02	0.4.5	Template deserialization & chunked processing	24x deserialization speedup
2025-06-01	0.2.0	Configuration system added	Up to 2.9x speedup possible with optimization
2025-05-30	0.3.0	Pickle Bridge feature added	New: ML pickle-to-JSON conversion (2,318 ops/sec)

Cache Scope Benchmarks (NEW in v0.7.0)¶

Demonstrates the performance impact of the configurable caching system. Results for 1000 repeated UUID strings on a typical laptop:

Cache Scope	Time (ms)
DISABLED	~2.8
OPERATION	~2.3
REQUEST	~1.9
PROCESS	~1.3

Run with:

# Cache scope benchmarks are now in the external datason-benchmarks repository
# See: https://github.com/danielendler/datason-benchmarks

Running Benchmarks¶

# Performance benchmarking is now handled by external datason-benchmarks repository
# Runs automatically on every PR via .github/workflows/pr-performance-check.yml

# For manual benchmarking:
# 1. Visit: https://github.com/danielendler/datason-benchmarks
# 2. Clone the repository
# 3. Follow the setup instructions

# Run local performance tests (minimal)
python -m pytest tests/performance/ -v

Interpreting Results¶

Statistical Significance¶

Mean: Primary performance metric
Standard deviation: Consistency indicator (lower = more consistent)
Operations per second: Throughput measurement

Real-World Context¶

Sub-millisecond: Excellent for interactive applications
Single-digit milliseconds: Good for API responses
Double-digit milliseconds: Acceptable for batch processing
100ms+: May need optimization for real-time use

Configuration Impact¶

Performance Config: Choose when speed is critical
Strict Config: Use for debugging, accept slower performance
Default: Good balance for most applications

Last updated: June 6, 2025 Benchmarks reflect datason v0.7.0 with performance optimizations and security hardening