Skip to content

datason Product Roadmap (Updated with Integration Feedback)

Mission: Make ML/data workflows reliably portable, readable, and structurally type-safe using human-friendly JSON.


🎯 Core Principles (Non-Negotiable)

Minimal Dependencies

  • Zero required dependencies for core functionality
  • Optional dependencies only for specific integrations (pandas, torch, etc.)
  • Never add dependencies that duplicate Python stdlib functionality

Performance First

  • Maintain <3x stdlib JSON overhead for simple types
  • Benchmark-driven development with regression prevention
  • Memory efficiency through configurable limits and smart defaults

Comprehensive Test Coverage

  • Maintain >90% test coverage across all features
  • Test all edge cases and failure modes
  • Performance regression testing for every release

�� Current State (v0.7.0) - SIGNIFICANTLY IMPROVED

Foundation Complete & Enhanced

  • Core Serialization: 20+ data types, circular reference detection, security limits
  • Configuration System: 4 preset configs + 13+ configurable options
  • Advanced Type Handling: Complex numbers, decimals, UUIDs, paths, enums, collections
  • ML/AI Integration: PyTorch, TensorFlow, scikit-learn, NumPy, JAX, PIL
  • Pandas Deep Integration: 6 DataFrame orientations, Series, Categorical, NaN handling
  • Performance Optimizations: Early detection, memory streaming, configurable limits
  • Comprehensive Testing: 79% coverage, 1060+ tests, benchmark suite
  • 🚀 MAJOR IMPROVEMENT: Enhanced deserialization engine with cache management and UUID/datetime detection
  • 🔒 Security Hardening: Complete protection against depth bombs, size bombs, and other attack vectors

📊 Performance Baseline (Enhanced)

  • Simple JSON: 1.6x overhead vs stdlib (excellent for added functionality)
  • Complex types: Only option for UUIDs/datetime/ML objects in pure JSON
  • Advanced configs: 15-40% performance improvement over default
  • 🆕 Deserialization Performance: 2.1M+ elements/second for NumPy arrays

RESOLVED Critical Issues from Real-World Usage

  • ✅ FIXED: All originally failing deserialization tests (9/9 resolved)
  • ✅ FIXED: UUID test order dependency and cache corruption issues
  • ✅ FIXED: Decimal handling in template deserializer with proper error handling
  • ✅ FIXED: Critical bug in _process_dict_optimized for non-numeric string conversion
  • ✅ IMPROVED: Enhanced safe_deserialize behavior consistency
  • ✅ ADDED: Comprehensive cache clearing functionality for testing

🔍 Validated by Comprehensive Testing

Results from Enhanced Test Suite: - ✅ 1060 tests passing, 0 failed (was 9 failing) - ✅ 79% overall test coverage (significant improvement) - ✅ 100% type preservation accuracy for serialization - ✅ Robust deserialization for core types - ⚠️ Partial deserialization gaps for complex ML types (identified by audit)

📊 Deserialization Audit Results (v0.7.0)

Current Round-Trip Status (68 total tests): - ✅ Basic Types: 95.0% success (19/20) - only set → list expected - ✅ Complex Types: 86.7% success (13/15) - UUID and nested structure gaps
- ✅ NumPy Types: 71.4% success (10/14) - scalars work, arrays need metadata - ⚠️ Pandas Types: 30.8% success (4/13) - DataFrames/Series need enhanced metadata - ❌ ML Types: 0.0% success (0/6) - PyTorch/sklearn need significant work


🚀 Updated Focused Roadmap

Philosophy: Perfect bidirectional ML serialization before expanding scope

v0.7.0 - COMPLETED: Critical Deserialization Fixes (COMPLETED)

"Fix core deserialization functionality blocking production adoption"

Achieved Goals

  • Fixed all critical deserialization bugs - 9/9 originally failing tests now pass
  • Enhanced UUID/datetime detection - robust ordering and pattern matching
  • Cache management system - prevents test order dependencies
  • Decimal handling improvements - proper error handling and type coercion
  • Security hardening - complete protection against attack vectors
  • Performance improvements - 2.1M+ elements/second processing

Success Metrics ACHIEVED

  • ✅ 100% of originally failing tests now pass
  • ✅ Comprehensive security test suite (28/28 tests passing)
  • ✅ 79% overall test coverage with robust deserialization engine
  • ✅ Zero breaking changes to existing API

v0.7.5 - Enhanced ML Type Metadata & Round-Trip Completion (COMPLETED AHEAD OF SCHEDULE)

"Complete the missing 32.4% round-trip gaps identified by audit" - EXCEEDED GOALS

🎯 ACHIEVED: Complete Template Deserializer Enhancement

EXCEEDED expectations with comprehensive template-based round-trip support.

# ✅ FIXED: All priority round-trip cases now work with templates
# Priority 1: Complex types with templates - 100% SUCCESS
uuid_template = uuid.UUID("12345678-1234-5678-9012-123456789abc")
reconstructed = deserialize_with_template(serialized_data, uuid_template)
assert isinstance(reconstructed, uuid.UUID)  # ✅ WORKS PERFECTLY

# Priority 2: ML types with templates - 100% SUCCESS  
torch_tensor = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
serialized = datason.serialize(torch_tensor)
reconstructed = deserialize_with_template(serialized, torch_tensor)
assert isinstance(reconstructed, torch.Tensor)  # ✅ WORKS PERFECTLY
assert torch.equal(reconstructed, torch_tensor)  # ✅ EXACT MATCH

# Priority 3: DataFrame/NumPy reconstruction - 100% SUCCESS
np_array = np.array([1, 2, 3], dtype=np.int32)
serialized = datason.serialize(np_array)
reconstructed = deserialize_with_template(serialized, np_array)
assert reconstructed.dtype == np.int32  # ✅ EXACT DTYPE PRESERVED

COMPLETED Implementation Goals - ALL ACHIEVED

  • Enhanced ML template deserializer - PyTorch tensors, sklearn models, NumPy arrays
  • Perfect type reconstruction - 100% success rate with templates
  • Comprehensive verification system - proper equality testing for all types (34 tests)
  • 4-Mode detection strategy - systematic testing across all detection modes
  • Zero new dependencies - extended existing type handler system

🏆 EXCEEDED Success Metrics - 100% ACHIEVEMENT

  • 100% template round-trip success for all supported ML types (exceeded 90% target)
  • 17+ types with perfect reconstruction via template system
  • 34 comprehensive tests covering all detection modes
  • Deterministic behavior - predictable type conversion across all modes
  • Zero performance regression - maintained existing functionality

v0.8.0 - Complete Round-Trip Support & Production ML Workflow (4-6 weeks)

"Perfect bidirectional type preservation - the foundation of ML portability"

🎯 Unique Value Proposition

Achieve the original v0.3.0 goals with comprehensive round-trip support.

# Complete bidirectional support with type metadata
data = {
    "model": sklearn_model,
    "features": np.array([[1.0, 2.0, 3.0]], dtype=np.float32),
    "timestamp": datetime.now(),
    "config": {"learning_rate": Decimal("0.001")}
}

# Serialize with type hints for perfect reconstruction
serialized = datason.serialize(data, include_type_hints=True)
# → {"model": {...}, "__datason_types__": {"features": "numpy.float32[1,3]"}}

# Perfect reconstruction - this MUST work reliably
reconstructed = datason.deserialize_with_types(serialized)
assert type(reconstructed["features"]) == np.ndarray
assert reconstructed["features"].dtype == np.float32
assert reconstructed["features"].shape == (1, 3)

# Legacy pickle bridge with perfect round-trips
json_data = datason.from_pickle("model.pkl", include_type_hints=True)
original_model = datason.deserialize_with_types(json_data)

🔧 Implementation Goals

  • CRITICAL: Achieve 99%+ round-trip success for all supported types
  • Enhanced type metadata system - comprehensive ML object reconstruction
  • Pickle bridge with round-trips - convert legacy files with full fidelity
  • Complete verification system - proper equality testing for all types
  • Zero new dependencies - extend existing type handler system

📈 Success Metrics

  • 99%+ round-trip fidelity for all supported types (dtype, shape, values)
  • 100% of ML objects can be serialized AND reconstructed perfectly
  • Support 95%+ of sklearn/torch/pandas pickle files with full round-trips
  • Zero new dependencies added
  • Complete type reconstruction test suite

v0.8.5 - Smart Deserialization & Enhanced ML Types (6-8 weeks)

"Intelligent type reconstruction + domain-specific type handlers"

🎯 Unique Value Proposition

Combine auto-detection with custom domain types (financial ML team request).

# Smart auto-detection deserialization
reconstructed = datason.safe_deserialize(json_data)  
# Uses heuristics: "2023-01-01T00:00:00" → datetime, [1,2,3] → list/array

# Custom domain type handlers (financial ML team validated need)
@datason.register_type_handler
class MonetaryAmount:
    def serialize(self, value):
        return {"amount": str(value.amount), "currency": value.currency}

    def deserialize(self, data):
        return MonetaryAmount(data["amount"], data["currency"])

# Extended ML framework support
data = {
    "xarray_dataset": xr.Dataset({"temp": (["x", "y"], np.random.random((3, 4)))}),
    "dask_dataframe": dd.from_pandas(large_df, npartitions=4),
    "financial_instrument": MonetaryAmount("100.50", "USD")
}
result = datason.serialize(data, config=get_ml_config())
reconstructed = datason.deserialize_with_types(result)  # Perfect round-trip

🔧 Implementation Goals

  • Custom type handler registration - extensible type system for domain types
  • Auto-detection deserialization - safe_deserialize with smart guessing
  • Extended ML support - xarray, dask, huggingface, scientific libs
  • Migration utilities - help teams convert from other formats

📈 Success Metrics

  • 85%+ accuracy in auto-type detection for common patterns
  • Support 10+ additional ML/scientific libraries
  • Custom type handler system for domain-specific types
  • Migration utilities for common serialization libraries

v0.9.0 - Performance Optimization & Monitoring (8-10 weeks)

"Make datason the fastest option with full visibility into performance"

🎯 Unique Value Proposition

ACCELERATED FROM v0.4.0: Financial ML team validated need for performance monitoring.

# Performance monitoring (financial team high-priority request)
with datason.profile() as prof:
    result = datason.serialize(large_financial_dataset)

print(prof.report())
# Output:
# Serialization Time: 1.2s, Memory Peak: 45MB
# Type Conversions: 1,247 (UUID: 89, DateTime: 445, DataFrame: 1)  
# Bottlenecks: DataFrame orientation conversion (0.8s)

# Memory-efficient streaming for large objects  
with datason.stream_serialize("large_experiment.json") as stream:
    stream.write({"model": huge_model})
    stream.write({"data": massive_dataset})

# Enhanced chunked processing (conditional based on user demand)
chunks = datason.serialize_chunked(massive_df, chunk_size=10000)
for chunk in chunks:
    store_chunk(chunk)  # Bounded memory usage

🔧 Implementation Goals

  • Performance profiling tools - detailed bottleneck identification
  • Memory streaming optimization - handle objects larger than RAM
  • Chunked processing - conditional feature based on additional user validation
  • Zero new dependencies - use stdlib profiling tools

📈 Success Metrics

  • Built-in performance monitoring with zero dependencies
  • 50%+ performance improvement for large ML objects
  • Handle 10GB+ objects with <2GB RAM usage
  • Maintain <2x stdlib overhead for simple JSON

🚨 Critical Changes from Original Roadmap

MAJOR PROGRESS ACHIEVED

  1. ✅ COMPLETED v0.7.0 Critical Fixes (was urgent priority)
  2. Result: All 9 originally failing tests now pass
  3. Impact: Core deserialization functionality now reliable
  4. Security: Complete protection against attack vectors

  5. 📊 Comprehensive Gap Analysis Complete

  6. Result: Deserialization audit identifies specific 32.4% gaps
  7. Impact: Clear roadmap for remaining round-trip work
  8. Priority: Focus on ML type metadata and verification systems

UPDATED PRIORITIES Based on v0.7.0 Success

  1. ML Type Round-Trips Moved to v0.7.5 (immediate priority)
  2. Rationale: Audit identified specific gaps in PyTorch/sklearn/pandas
  3. Impact: 22 failing test cases provide clear implementation targets
  4. Risk: Low - extends working deserialization engine

  5. Performance Monitoring Maintained at v0.9.0

  6. Rationale: Round-trip completion must come first
  7. Impact: Solid foundation needed before optimization
  8. Risk: Low - existing performance already excellent

SUCCESS VALIDATED

CORE FUNCTIONALITY: All originally failing tests now pass ✅ SECURITY: 28/28 security tests passing with hardened limits ✅ PERFORMANCE: 79% test coverage, 2.1M+ elements/second processing ✅ STABILITY: 1060 tests passing consistently with cache management

REMAINING WORK CLEAR

ML Round-Trips: 22 specific test cases failing (32.4% gap) ❌ PyTorch/sklearn: Need enhanced metadata deserialization ❌ Pandas DataFrames: Inconsistent metadata reconstruction ❌ NumPy Arrays: Need shape/dtype preservation


🎯 Success Metrics (Updated for v0.7.0)

Technical Excellence (Current)

  • Core Round-Trip Fidelity: ✅ 95%+ for basic types, 86.7% for complex types
  • Performance: ✅ <2x stdlib JSON for simple types, 2.1M+ elements/second
  • Reliability: ✅ All originally failing tests fixed, 1060 tests passing
  • Quality: ✅ 79% test coverage with comprehensive round-trip testing
  • Security: ✅ Complete protection against all known attack vectors

Targets for v0.8.0

  • Round-Trip Fidelity: 99.9%+ accuracy for all supported ML objects
  • ML Type Support: 100% of PyTorch/sklearn/pandas objects with metadata
  • Performance: Maintain current excellent performance
  • Migration: 95%+ successful conversions from pickle/other formats

Community Impact (Projected)

  • v0.8.0: 10,000+ monthly active users (complete round-trip support)
  • v0.9.0: Standard tool in 5+ major ML frameworks' documentation
  • v1.0: 100,000+ downloads, referenced in production ML tutorials

🔍 Validation from v0.7.0 Success

This updated roadmap reflects the major progress achieved:

FOUNDATION SOLID: Core deserialization engine working reliably ✅ GAPS IDENTIFIED: Clear 32.4% remaining work with specific test cases ✅ SECURITY HARDENED: Complete protection validated with 28 security tests ✅ PERFORMANCE PROVEN: 2.1M+ elements/second with 79% test coverage

Key Insight: The comprehensive deserialization audit provides a clear roadmap for the remaining 22 failing test cases, making v0.8.0 completion highly achievable.


Roadmap Principles: Perfect bidirectional ML serialization, stay focused, stay fast, stay simple

Updated: December 2024 based on v0.7.0 deserialization improvements and comprehensive audit
Next review: Q1 2025 after v0.8.0 round-trip completion