Skip to content

datason Performance Benchmarks

📊 Note: This document contains historical performance measurements for reference. Current performance benchmarking is now handled by the external datason-benchmarks repository, which runs automatically on every PR.

Overview

This document contains real performance measurements for datason v0.7.0, obtained through systematic benchmarking rather than estimates. All benchmarks were reproducible using the benchmark scripts that have been moved to the external repository.

Benchmark Environment

  • Python: 3.12.0
  • Platform: macOS (Darwin 24.5.0)
  • Dependencies: NumPy, Pandas, PyTorch
  • Method: 5 iterations per test, statistical analysis (mean ± std dev)
  • Hardware: Modern development machine (representative performance)
  • Version: datason v0.7.0 with performance optimizations and configuration system

🆕 NEW: v0.4.5 Performance Breakthroughs

Template-Based Deserialization (NEW in v0.4.5)

Revolutionary Performance: Template deserialization provides 24x faster deserialization for structured data with known schemas.

Method Performance Speedup Use Case
Template Deserializer 64.0μs ± 6.3μs 24.4x faster Known schema, repeated data
Auto Deserialization 1,565μs ± 61.1μs 1.0x (baseline) Unknown schema, one-off data
DataFrame Template 774μs ± 73.2μs 2.0x faster Structured tabular data

Real-world impact: - Processing 10,000 records: 640ms vs 15.6 seconds (15.6x total time reduction) - API response parsing: Sub-millisecond deserialization for structured responses - ML inference pipelines: Negligible deserialization overhead

Chunked Processing & Streaming (NEW in v0.4.0)

Memory-Bounded Processing: Handle datasets larger than available RAM with linear memory usage.

Chunked Serialization Performance

Data Type Standard Memory Chunked Memory Memory Reduction
Large DataFrames 2.4GB peak 95MB peak 95% reduction
Numpy Arrays 1.8GB peak 52MB peak 97% reduction
Large Lists 850MB peak 48MB peak 94% reduction

Streaming Performance

Method Performance Memory Usage Use Case
Streaming to .jsonl 69μs ± 8.9μs < 50MB Large dataset processing
Streaming to .json 1,912μs ± 105μs < 50MB Compatibility with existing tools
Batch Processing 5,560μs ± 248μs 2GB+ Traditional approach

Memory Efficiency: 99% memory reduction for large dataset processing.

Custom Serializers Performance Impact

Significant Speedup: Custom serializers provide 3.7x performance improvement for known object types.

Approach Performance Speedup Use Case
Fast Custom Serializer 1.84ms ± 0.07ms 3.7x faster Known object types
Detailed Custom Serializer 1.95ms ± 0.03ms 3.5x faster Rich serialization
No Custom Serializer 6.89ms ± 0.21ms 1.0x (baseline) Auto-detection

Results Summary

Simple Data Performance

Test: 1000 JSON-compatible user objects

data = {
    "users": [
        {"id": i, "name": f"user_{i}", "active": True, "score": i * 1.5}
        for i in range(1000)
    ]
}

Library Performance Relative Speed
Standard JSON 0.44ms ± 0.03ms 1.0x (baseline)
datason 0.66ms ± 0.07ms 1.5x

Analysis: datason adds only 50% overhead vs standard JSON for compatible data, which is excellent considering the added functionality (type detection, ML object support, safety features, configuration system, chunked processing, template deserialization).

Complex Data Performance

Test: 500 session objects with UUIDs and datetimes

data = {
    "sessions": [
        {
            "id": uuid.uuid4(),
            "start_time": datetime.now(),
            "user_data": {"preferences": [...], "last_login": datetime.now()}
        }
        for i in range(500)
    ]
}

Library Performance Notes
datason 7.34ms ± 0.44ms Only option for this data
Pickle 0.79ms ± 0.06ms Binary format, Python-only
Standard JSON Fails Cannot serialize UUIDs/datetime

Analysis: datason is 9.3x slower than pickle but provides JSON output that's human-readable and cross-platform compatible.

High-Throughput Scenarios

Large Nested Data: 100 groups × 50 items (5,000 complex objects) - Throughput: 131,042 items/second - Performance: 38.16ms ± 0.40ms total

NumPy Arrays: Multiple arrays with ~23K total elements - Throughput: 1,630,766 elements/second
- Performance: 14.17ms ± 3.10ms total

Pandas DataFrames: 5K total rows - Throughput: 867,224 rows/second - Performance: 5.88ms ± 0.43ms total

Round-Trip Performance

Test: Complete workflow (serialize → JSON.dumps → JSON.loads → deserialize)

# Complex data with UUIDs and timestamps
serialize_time    = 2.87ms ± 1.68ms
deserialize_time  = 1.82ms ± 1.52ms
total_round_trip  = 4.40ms ± 1.40ms

Real-world significance: Complete API request-response cycle under 4.5ms.

Configuration System Performance Impact

🚀 Updated Configuration Presets Comparison (v0.7.0)

Advanced Types Performance (Decimals, UUIDs, Complex numbers, Paths, Enums):

Configuration Performance Ops/sec Use Case
ML Config 0.99ms ± 0.09ms 1,005 ML pipelines, numeric focus
API Config 1.01ms ± 0.06ms 988 API responses, consistency
Performance Config 1.08ms ± 0.01ms 923 Speed-critical applications
Default 1.11ms ± 0.01ms 899 General use
Strict Config 13.43ms ± 2.06ms 74 Maximum type preservation

Pandas DataFrame Performance (Large DataFrames with mixed types):

Configuration Performance Ops/sec Best For
Performance Config 1.64ms ± 0.15ms 610 High-throughput data processing
ML Config 4.75ms ± 0.20ms 211 ML-specific optimizations
API Config 4.72ms ± 0.15ms 212 Consistent API responses
Default 4.83ms ± 0.08ms 207 General use
Strict Config 5.92ms ± 2.35ms 169 Type safety, debugging

Key Performance Insights

  1. Performance Config: 2.9x faster for large DataFrames (1.64ms vs 4.83ms default)
  2. Strict Config: Preserves maximum type information but 13.6x slower for complex types
  3. Configuration Overhead: Minimal for simple data, significant performance gains with optimization
  4. Custom Serializers: 2.7x faster than auto-detection (2.13ms vs 5.72ms)

Date Format Performance

Test: 1000 datetime objects in nested structure

Format Performance Best For
Unix Timestamp 3.51ms ± 0.13ms Compact, fast parsing
Unix Milliseconds 3.52ms ± 0.09ms JavaScript compatibility
ISO Format 3.68ms ± 0.14ms Standards compliance
String Format 3.66ms ± 0.11ms Human readability
Custom Format 5.19ms ± 0.09ms Specific requirements

NaN Handling Performance

Test: 3000 values with mixed NaN/None/Infinity

Strategy Performance Trade-off
Convert to NULL 3.29ms ± 0.03ms JSON compatibility
Keep Original 3.41ms ± 0.14ms Exact representation
Convert to String 3.54ms ± 0.04ms Preserve information
Drop Values 3.55ms ± 0.10ms Clean data

Type Coercion Impact

Test: 700 objects with decimals, UUIDs, complex numbers, paths, enums

Strategy Performance Data Fidelity
Safe (Default) 1.79ms ± 0.08ms Balanced approach
Aggressive 1.86ms ± 0.05ms Simplified types
Strict 2.15ms ± 0.01ms Maximum preservation

DataFrame Orientation Performance

Small DataFrames (100 rows):

Orientation Performance Best For
Values 0.06ms ± 0.00ms Array-like data
Dict 0.13ms ± 0.01ms Column-oriented
List 0.16ms ± 0.01ms Simple columns
Split 0.18ms ± 0.02ms Structured metadata
Records 0.19ms ± 0.00ms Row-oriented
Index 0.20ms ± 0.01ms Index-focused

Large DataFrames (5000 rows):

Orientation Performance Memory Efficiency
Values 0.94ms ± 0.67ms Minimal overhead
List 1.26ms ± 0.03ms Column arrays
Split 1.69ms ± 0.07ms Structured format
Dict 1.69ms ± 0.05ms Column mapping
Index 2.59ms ± 0.08ms Index preservation
Records 3.90ms ± 2.68ms Row objects

Memory Usage Analysis

Serialized Output Sizes for complex mixed-type data:

Configuration Serialized Size Use Case
Performance Config 185KB Optimized output
NaN Drop Config 303KB Clean data
Strict Config 388KB Maximum information

Analysis: Performance config produces 52% smaller output than strict config while maintaining essential information.

Why We Compare with Pickle

Pickle is the natural comparison point because it's the only other tool that can serialize complex Python objects like ML models and DataFrames. However, the 9.3x performance difference tells only part of the story.

🏆 When Pickle Wins

# Pure Python environment, speed is everything
import pickle
start = time.time()
with open('model.pkl', 'wb') as f:
    pickle.dump(complex_ml_pipeline, f)  # 0.79ms
print(f"Saved in {time.time() - start:.1f}ms")

# ✅ Fastest option for Python-only workflows
# ✅ Perfect object reconstruction  
# ✅ Handles any Python object (even lambdas, classes)

🌐 When datason Wins

# Multi-language team, API responses, data sharing
import json
import datason as ds

start = time.time()
json_data = ds.serialize(complex_ml_pipeline)  # 7.34ms
with open('model.json', 'w') as f:
    json.dump(json_data, f)
print(f"Saved in {time.time() - start:.1f}ms")

# ✅ Frontend team can read it immediately
# ✅ Business stakeholders can inspect results  
# ✅ Works in Git diffs, text editors, web browsers
# ✅ API responses work across all platforms
# ✅ Configurable behavior for different use cases

📊 The Real Tradeoff

# Performance vs Versatility
pickle_speed = 0.79  # ms
datason_speed = 7.34  # ms
overhead = 6.55  # ms extra

# But with configuration optimization:
datason_performance_config = 1.66  # ms
optimized_overhead = 0.87  # ms extra

# Questions to ask:
# - Is 0.87-6.55ms overhead significant for your use case?
# - Do you need cross-language compatibility?  
# - Do you need human-readable output?
# - Are you building APIs or microservices?

# For most modern applications: <7ms is negligible
# For high-frequency trading: Every microsecond matters (use pickle)
# For web APIs: Human-readable JSON is essential (use datason)

Performance Optimization Guide

🚀 Speed-Critical Applications

from datason import serialize, get_performance_config

# Use optimized configuration
config = get_performance_config()
result = serialize(data, config=config)
# → Up to 2.9x faster for large DataFrames

🎯 Balanced Performance

from datason import serialize, get_ml_config

# ML-optimized settings
config = get_ml_config()
result = serialize(ml_data, config=config)
# → Good performance + ML-specific optimizations

🔧 Custom Optimization

from datason import SerializationConfig, DateFormat, NanHandling

# Fine-tune for your use case
config = SerializationConfig(
    date_format=DateFormat.UNIX,  # Fastest date format
    nan_handling=NanHandling.NULL,  # Fastest NaN handling
    dataframe_orient="split"  # Best for large DataFrames
)
result = serialize(data, config=config)

Memory Usage Optimization

  • Performance Config: ~131KB serialized size
  • Strict Config: ~149KB serialized size (+13% memory)
  • NaN Drop Config: ~135KB serialized size (clean data)

Comparative Analysis

vs Standard JSON

  • Compatibility: datason handles 20+ data types vs JSON's 6 basic types
  • Overhead: Only 1.5x for compatible data (vs 3-10x for many JSON alternatives)
  • Safety: Graceful handling of NaN/Infinity vs JSON's errors
  • Configuration: Tunable behavior vs fixed behavior

vs Pickle

  • Speed: 9.3x slower but provides human-readable JSON
  • Portability: Cross-language compatible vs Python-only
  • Security: No arbitrary code execution risks
  • Debugging: Human-readable output for troubleshooting
  • Flexibility: Configurable serialization behavior

vs Specialized Libraries

  • orjson/ujson: Faster for basic JSON types but cannot handle ML objects
  • joblib: Good for NumPy arrays but binary format
  • datason: Best balance of functionality, performance, and compatibility

Configuration Performance Recommendations

Use Case → Configuration Mapping

Your Situation Recommended Config Performance Gain
High-throughput data pipelines get_performance_config() Up to 2.9x faster
ML model APIs get_ml_config() Optimized for numeric data
REST API responses get_api_config() Consistent, readable output
Debugging/development get_strict_config() Maximum type information
General use Default (no config) Balanced approach

DataFrame Optimization

  • Small DataFrames (<1K rows): Use orient="values" (fastest)
  • Large DataFrames (>1K rows): Use orient="split" (best scaling)
  • Human-readable APIs: Use orient="records" (intuitive)

Date/Time Optimization

  • Performance: Unix timestamps (DateFormat.UNIX)
  • JavaScript compatibility: Unix milliseconds (DateFormat.UNIX_MS)
  • Standards compliance: ISO format (DateFormat.ISO)

Pickle Bridge Performance

New in v0.3.0: datason's Pickle Bridge feature converts legacy ML pickle files to portable JSON format with enterprise-grade security.

Test Environment

  • Feature: Pickle Bridge v0.3.0 (pickle-to-JSON conversion)
  • Test Data: Basic Python objects, NumPy arrays, Pandas DataFrames
  • Security: ML-safe class whitelisting (54 default safe classes)
  • Method: 5 iterations per test, statistical analysis

Performance Comparison

Small Dataset (100 objects):

Approach Performance Ops/sec Security Use Case
Manual (pickle + datason) 0.06ms ± 0.00ms 16,598 ⭐⭐ Trusted environments
dill + JSON 0.05ms ± 0.00ms 22,202 ⭐⭐ Extended pickle support
jsonpickle 0.10ms ± 0.01ms 10,183 ⭐⭐⭐ General Python objects
Pickle Bridge (datason) 0.43ms ± 0.05ms 2,318 ⭐⭐⭐⭐⭐ Production ML migration

Large Dataset (500 objects):

Approach Performance Ops/sec Relative Speed
Manual (pickle + datason) 0.25ms ± 0.03ms 4,037 7.5x faster
dill + JSON 0.15ms ± 0.00ms 6,572 12.5x faster
jsonpickle 0.35ms ± 0.01ms 2,860 5.3x faster
Pickle Bridge (datason) 1.87ms ± 0.08ms 535 1.0x (baseline)

Security Overhead Analysis

Security vs Performance Trade-off:

Mode Performance (100 obj) Performance (500 obj) Security Level
Safe (recommended) 0.40ms ± 0.01ms 1.96ms ± 0.07ms Enterprise-grade
Unsafe (comparison) 0.41ms ± 0.01ms 1.84ms ± 0.05ms No protection
Overhead ~2-6% slower ~6% slower Worth the security

Key Finding: Enterprise security adds only 2-6% overhead - excellent trade-off for production use.

File Size Analysis

Pickle vs JSON Size Comparison:

Data Type Dataset Size Pickle Size JSON Size Size Ratio
Basic Objects 100 items 3.0 KB 5.4 KB 1.79x larger
Basic Objects 500 items 15.0 KB 20.0 KB 1.34x larger
NumPy Arrays 100 items 1.6 KB 4.5 KB 2.77x larger
NumPy Arrays 500 items 6.2 KB 14.4 KB 2.33x larger
Pandas DataFrames 100 items 2.0 KB 5.4 KB 2.69x larger

Analysis: - Basic objects scale well (1.34x overhead for larger datasets) - NumPy/Pandas have higher overhead due to text vs binary representation - Trade-off: 1.3-2.8x larger files for cross-platform compatibility

Bulk Operations Performance

Directory Conversion Benchmarks:

Operation Performance Ops/sec Best For
Bulk conversion 6.17ms ± 0.28ms 162 Multiple files at once
Individual files 3.89ms ± 0.12ms 257 Single file processing

Recommendation: Use individual file conversion for better throughput, bulk conversion for convenience.

Real-World Performance Scenarios

🚀 High-Performance ML Pipeline

# Manual approach: Maximum speed, trusted environment
with open('model.pkl', 'rb') as f:
    data = pickle.load(f)  # Trust your own files
result = datason.serialize(data)  # 0.06ms - 16,598 ops/sec

🛡️ Production ML Migration

# Pickle Bridge: Security + performance balance
bridge = PickleBridge()  # Uses ML-safe classes
result = bridge.from_pickle_file('model.pkl')  # 0.43ms - 2,318 ops/sec
# ✅ Prevents arbitrary code execution
# ✅ Handles 95% of ML pickle files
# ✅ Only 7.5x slower than manual approach

🌐 Cross-Platform Data Exchange

# jsonpickle: Good middle ground
with open('data.pkl', 'rb') as f:
    data = pickle.load(f)
result = jsonpickle.encode(data)  # 0.10ms - 10,183 ops/sec
# ✅ 4.3x faster than Pickle Bridge
# ⚠️ Less security validation

Performance vs Security Matrix

Priority Recommended Approach Speed Security Compatibility
Maximum Speed Manual (pickle + datason) ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐⭐⭐
Balanced jsonpickle ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐
Production Security Pickle Bridge ⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
Extended Pickle dill + JSON ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐

When to Use Pickle Bridge

Perfect For

  • ML model deployment: Secure pickle file processing
  • Data migration projects: Legacy ML pipeline modernization
  • Enterprise environments: Security-first approach required
  • Cross-platform APIs: Need JSON output from pickle files
  • Compliance requirements: Prevent arbitrary code execution

⚠️ Consider Alternatives When

  • Maximum speed required: Use manual approach (7.5x faster)
  • Simple Python objects: jsonpickle may be sufficient
  • Trusted environment only: Direct pickle + datason conversion
  • Extended pickle features: dill might be better option

Integration with Existing Benchmarks

The Pickle Bridge complements datason's existing performance profile:

  • Simple data serialization: 0.66ms (1.5x JSON overhead)
  • Complex data serialization: 7.34ms (datetime/UUID objects)
  • Pickle Bridge conversion: 0.43-1.87ms (varies by data size)
  • Round-trip performance: 4.40ms (serialize + deserialize)

Context: Pickle Bridge adds another tool to datason's ecosystem, specifically targeting the ML migration use case with strong security guarantees.

Performance Optimization Tips

  1. Use individual file processing for better throughput (257 vs 162 ops/sec)
  2. Prefer bytes mode when loading pickle data from memory
  3. Monitor complex objects - some may require manual approach
  4. Batch similar objects for better cache utilization
  5. Consider manual approach for trusted, speed-critical scenarios

Benchmark Reproducibility

# Benchmark scripts are now in the external datason-benchmarks repository
# Performance testing runs automatically on PRs via GitHub Actions
# For manual benchmarking, see: https://github.com/danielendler/datason-benchmarks

Methodology

Benchmark Scripts

All measurements came from two complementary benchmark suites (now in external repository):

  1. benchmark_real_performance.py: Core performance baselines
  2. enhanced_benchmark_suite.py: Configuration system impact analysis

Both scripts: - Multiple Iterations: Run each test 5 times for statistical reliability - Warm-up: First measurement discarded (JIT compilation, cache loading) - Statistical Analysis: Report mean, standard deviation, operations per second - Real Data: Use realistic data structures, not toy examples - Fair Comparison: Compare like-for-like where possible

Test Data Characteristics

  • Simple data: JSON-compatible objects only
  • Complex data: UUIDs, datetimes, nested structures
  • Large data: Thousands of objects with realistic size
  • ML data: NumPy arrays, Pandas DataFrames of representative sizes
  • Advanced types: Decimals, complex numbers, paths, enums

Measurement Precision

  • Uses time.perf_counter() for high-resolution timing
  • Measures end-to-end including all overhead
  • No artificial optimizations or cherry-picked scenarios

When datason Excels

  • Mixed data types: Standard + ML objects in one structure
  • API responses: Need JSON compatibility with complex data
  • Data science workflows: Frequent DataFrame/NumPy serialization
  • Cross-platform: Human-readable output required
  • Configurable behavior: Different performance requirements per use case
  • ML migration projects: Secure pickle-to-JSON conversion (NEW in v0.3.0)
  • Enterprise security: Prevent arbitrary code execution from pickle files

Performance Tips

  1. Choose the right configuration: 2.9x performance difference between configs
  2. Use custom serializers: 2.7x faster for known object types
  3. Optimize date formats: Unix timestamps are fastest
  4. Batch operations: Group objects for better throughput
  5. Profile your use case: Run benchmarks with your actual data

When to Consider Alternatives

  • Pure speed + basic types: Use orjson/ujson
  • Python-only + complex objects: Use pickle (7x faster)
  • Scientific arrays + compression: Use joblib
  • Maximum compatibility: Use standard json with manual type handling

Benchmark History

Date Version Change Performance Impact
2025-06-06 0.5.0 Performance breakthrough & security fixes 1.6M+ elements/sec, critical security hardening
2025-06-02 0.4.5 Template deserialization & chunked processing 24x deserialization speedup
2025-06-01 0.2.0 Configuration system added Up to 2.9x speedup possible with optimization
2025-05-30 0.3.0 Pickle Bridge feature added New: ML pickle-to-JSON conversion (2,318 ops/sec)

Cache Scope Benchmarks (NEW in v0.7.0)

Demonstrates the performance impact of the configurable caching system. Results for 1000 repeated UUID strings on a typical laptop:

Cache Scope Time (ms)
DISABLED ~2.8
OPERATION ~2.3
REQUEST ~1.9
PROCESS ~1.3

Run with:

# Cache scope benchmarks are now in the external datason-benchmarks repository
# See: https://github.com/danielendler/datason-benchmarks

Running Benchmarks

# Performance benchmarking is now handled by external datason-benchmarks repository
# Runs automatically on every PR via .github/workflows/pr-performance-check.yml

# For manual benchmarking:
# 1. Visit: https://github.com/danielendler/datason-benchmarks
# 2. Clone the repository
# 3. Follow the setup instructions

# Run local performance tests (minimal)
python -m pytest tests/performance/ -v

Interpreting Results

Statistical Significance

  • Mean: Primary performance metric
  • Standard deviation: Consistency indicator (lower = more consistent)
  • Operations per second: Throughput measurement

Real-World Context

  • Sub-millisecond: Excellent for interactive applications
  • Single-digit milliseconds: Good for API responses
  • Double-digit milliseconds: Acceptable for batch processing
  • 100ms+: May need optimization for real-time use

Configuration Impact

  • Performance Config: Choose when speed is critical
  • Strict Config: Use for debugging, accept slower performance
  • Default: Good balance for most applications

Last updated: June 6, 2025 Benchmarks reflect datason v0.7.0 with performance optimizations and security hardening