datason Product Roadmap¶
Mission: Make ML/data workflows reliably portable, readable, and structurally type-safe using human-friendly JSON.
🎯 Core Principles (Non-Negotiable)¶
✅ Minimal Dependencies¶
- Zero required dependencies for core functionality
- Optional dependencies only for specific integrations (pandas, torch, etc.)
- Never add dependencies that duplicate Python stdlib functionality
✅ Performance First¶
- Maintain <3x stdlib JSON overhead for simple types
- Benchmark-driven development with regression prevention
- Memory efficiency through configurable limits and smart defaults
✅ Comprehensive Test Coverage¶
- Maintain >90% test coverage across all features
- Test all edge cases and failure modes
- Performance regression testing for every release
🎯 Current State (v0.1.4)¶
✅ Foundation Complete¶
- Core Serialization: 20+ data types, circular reference detection, security limits
- Configuration System: 4 preset configs + 13+ configurable options
- Advanced Type Handling: Complex numbers, decimals, UUIDs, paths, enums, collections
- ML/AI Integration: PyTorch, TensorFlow, scikit-learn, NumPy, JAX, PIL
- Pandas Deep Integration: 6 DataFrame orientations, Series, Categorical, NaN handling
- Performance Optimizations: Early detection, memory streaming, configurable limits
- Comprehensive Testing: 83% coverage, 300+ tests, benchmark suite
📊 Performance Baseline¶
- Simple JSON: 1.6x overhead vs stdlib (excellent for added functionality)
- Complex types: Only option for UUIDs/datetime/ML objects in pure JSON
- Advanced configs: 15-40% performance improvement over default
🚀 Focused Roadmap¶
Philosophy: Deepen what datason uniquely does well rather than expanding scope
v0.3.0 - Pickle Bridge¶
"Convert legacy ML pickle files to portable JSON - solve a real workflow pain point"
🎯 Unique Value Proposition¶
No other JSON serializer handles the ML community's massive pickle legacy. This bridges the gap safely.
# Convert pickle to datason JSON (unique capability)
import datason
# Safe conversion with class whitelisting
json_data = datason.from_pickle("model.pkl",
safe_classes=["sklearn", "numpy", "pandas"])
# Bulk migration tools for ML teams
datason.convert_pickle_directory("old_models/", "json_models/")
🔧 Implementation Goals¶
- Zero new dependencies - use stdlib
pickle
module only - Security-first - whitelist approach for safe class loading
- Leverage existing type handlers - reuse 100% of current ML object support
- Maintain performance - streaming conversion for large pickle files
📈 Success Metrics¶
- Support 95%+ of sklearn/torch/pandas pickle files
- Zero security vulnerabilities in pickle processing
- <5% performance overhead vs direct pickle loading
- No new dependencies added
v0.3.5 - Advanced ML Types¶
"Handle more ML framework objects that competitors can't serialize"
🎯 Unique Value Proposition¶
Extend datason's unique strength - serializing ML objects that other JSON libraries simply can't handle.
# New ML object support (competitors can't do this)
import xarray as xr
import dask.dataframe as dd
data = {
"xarray_dataset": xr.Dataset({"temp": (["x", "y"], np.random.random((3, 4)))}),
"dask_dataframe": dd.from_pandas(large_df, npartitions=4),
"pytorch_dataset": torchvision.datasets.MNIST(root="./data"),
"huggingface_tokenizer": AutoTokenizer.from_pretrained("bert-base-uncased")
}
# Only datason can serialize this to readable JSON
result = datason.serialize(data, config=get_ml_config())
🔧 Implementation Goals¶
- Extend type handler system - no architectural changes needed
- Optional dependencies only - graceful fallbacks when libs unavailable
- Consistent with existing patterns - reuse configuration system
- Maintain performance - efficient handling of large scientific objects
📈 Success Metrics¶
- Support 10+ additional ML/scientific libraries
- Zero new required dependencies
- Maintain existing performance characteristics
- 100% backward compatibility
v0.4.0 - Performance & Memory Optimization¶
"Make datason the fastest option for ML object serialization"
🎯 Unique Value Proposition¶
Other JSON libraries either can't handle ML objects or are slow. Make datason both capable AND fast.
# Optimized for large ML workflows
import datason
# Memory-efficient streaming for large objects
with datason.stream_serialize("large_experiment.json") as stream:
stream.write({"model": huge_model})
stream.write({"data": massive_dataset})
# Memory usage stays bounded
# Parallel serialization for multi-object workflows
results = datason.serialize_parallel([model1, model2, model3])
🔧 Implementation Goals¶
- Zero new dependencies - optimize existing algorithms
- Memory streaming - handle objects larger than RAM
- Parallel processing - utilize multiple cores efficiently
- Profile-driven optimization - target real-world ML bottlenecks
📈 Success Metrics¶
- 50%+ performance improvement for large ML objects
- Handle 10GB+ objects with <2GB RAM usage
- Maintain <2x stdlib overhead for simple JSON
- Zero new dependencies
v0.4.5 - Typed Deserialization & Round-Trip Support¶
"Complete the portability story with safe data reconstruction"
🎯 Unique Value Proposition¶
Enable truly portable ML workflows by safely reconstructing Python objects from datason JSON.
# Template-based deserialization with type safety
import datason
# Infer template from existing objects
template = datason.infer_template({"features": np.array([[1,2,3]]), "timestamp": datetime.now()})
# Result: {"features": Array(dtype="float64", shape=(-1, 3)), "timestamp": DateTime()}
# Type-safe reconstruction
json_data = '{"features": [[1.0, 2.0, 3.0]], "timestamp": "2024-01-15T10:30:00"}'
reconstructed = datason.cast_to_template(json_data, template)
assert reconstructed["features"].dtype == np.float64 # Type guaranteed
🔧 Implementation Goals¶
- Leverage existing type handlers - reuse serialization logic in reverse
- Zero new dependencies - use stdlib and existing optional deps
- Template inference - automatically generate cast templates from examples
- Type safety - prevent runtime type errors in ML pipelines
📈 Success Metrics¶
- 99%+ fidelity for numpy array round-trips (dtype, shape, values)
- Support for 15+ cast types (arrays, datetime, ML objects)
- <20% overhead vs naive JSON parsing
- Zero runtime type errors with proper templates
v0.5.0 - Configuration Refinement¶
"Perfect the configuration system based on real-world usage"
🎯 Unique Value Proposition¶
No other JSON serializer offers ML-specific configuration presets. Refine this unique advantage.
# Enhanced presets based on user feedback
import datason
from datason.config import get_inference_config, get_research_config
# New specialized configurations
inference_config = get_inference_config() # Optimized for model serving
research_config = get_research_config() # Preserve maximum information
logging_config = get_logging_config() # Safe for production logs
training_config = get_training_config() # Balance speed and fidelity
# Environment-based auto-configuration
datason.auto_configure() # Detects ML environment and optimizes
🔧 Implementation Goals¶
- Refine existing system - no architectural changes
- User feedback driven - address real pain points
- Maintain simplicity - keep API surface small
- Performance tuning - optimize configuration combinations
📈 Success Metrics¶
- 90%+ user satisfaction with default configurations
- 25%+ performance improvement for common workflows
- Zero breaking changes to existing API
- Maintain zero required dependencies
v0.5.5 - Production Safety & Redaction¶
"Make datason safe for production ML logging and compliance"
🎯 Unique Value Proposition¶
ML workflows often contain sensitive data. Provide built-in redaction without breaking serialization fidelity.
# Safe logging configuration for production ML
import datason
from datason.config import SerializationConfig
# Field-level redaction with smart patterns
config = SerializationConfig(
redact_fields=["password", "api_key", "*.secret", "user.email"],
redact_large_objects=True, # Auto-redact >10MB objects
redact_patterns=[r"\b\d{4}-\d{4}-\d{4}-\d{4}\b"], # Credit cards
redaction_replacement="<REDACTED>",
include_redaction_summary=True
)
# Safe for production logs
result = datason.serialize(ml_experiment_data, config=config)
🔧 Implementation Goals¶
- Extend configuration system - no new dependencies
- Pattern-based redaction - regex and field path matching
- Audit trails - track what was redacted and why
- Performance conscious - minimal overhead when not redacting
📈 Success Metrics¶
- 99.9%+ sensitive data detection for common patterns
- <5% false positive rate for redaction
- <10% performance overhead for redaction processing
- Zero new dependencies
v0.6.0 - Snapshot Testing & ML DevX¶
"Turn datason's readable JSON into powerful ML testing infrastructure"
🎯 Unique Value Proposition¶
Leverage datason's human-readable JSON to create the best ML testing experience available.
# Snapshot testing for ML workflows
import datason
# Generate readable test snapshots
@datason.snapshot_test("test_model_prediction")
def test_model_output():
model = load_trained_model()
prediction = model.predict(test_data)
# Auto-generates human-readable JSON snapshot
datason.assert_snapshot(prediction, normalize_floats=True)
# Update snapshots when behavior intentionally changes
datason.update_snapshots("test_model_*", reason="Added new features")
# Compare model outputs semantically
datason.assert_equivalent(old_predictions, new_predictions,
tolerance=1e-6, ignore_fields=["timestamp"])
🔧 Implementation Goals¶
- Leverage existing serialization - build on proven JSON generation
- Git-friendly diffs - human-readable changes in model outputs
- ML-specific normalization - handle float precision, timestamps, etc.
- Integration friendly - work with pytest, unittest, CI/CD
📈 Success Metrics¶
- 50%+ reduction in ML test maintenance overhead
- Support for 95%+ of ML output types
- <10s snapshot update time for large test suites
- Zero false positive failures from irrelevant changes
v0.6.5 - Test Infrastructure & Quality¶
"Achieve industry-leading reliability for production ML workflows"
🎯 Unique Value Proposition¶
ML workflows need bulletproof reliability. Make datason the most tested serialization library.
# Enhanced testing and validation tools
import datason
# Built-in validation for ML workflows
result = datason.serialize(model, validate=True) # Catches issues early
# Property-based testing for edge cases
datason.test_roundtrip(my_custom_object) # Ensures perfect fidelity
# Performance monitoring
with datason.monitor() as m:
result = datason.serialize(large_data)
print(f"Serialized {m.size_mb}MB in {m.time_ms}ms")
🔧 Implementation Goals¶
- Expand test coverage to 95%+ - test all edge cases
- Property-based testing - use hypothesis for edge case discovery
- Performance regression tests - prevent performance degradation
- ML-specific test utilities - help users test their integrations
📈 Success Metrics¶
- 95%+ test coverage across all modules
- Zero critical bugs in production deployments
- Comprehensive property-based test suite
- Built-in performance monitoring tools
v0.7.0 - Delta Serialization & Efficiency¶
"Optimize storage and improve traceability for evolving ML objects"
🎯 Unique Value Proposition¶
Make ML experiment tracking and model versioning storage-efficient with structural diffs.
# Delta-aware serialization for efficient storage
import datason
# Only serialize what changed
baseline_model = load_model("v1.0")
updated_model = train_incremental(baseline_model, new_data)
delta = datason.serialize_delta(updated_model, baseline=baseline_model)
# Result: {"changed_params": {"layer_3.weight": [...]}, "metadata": {...}}
# Reconstruct from baseline + delta
reconstructed = datason.apply_delta(baseline_model, delta)
# Great for Git-friendly experiment tracking
datason.save_experiment_delta("experiment_v2.json", new_experiment,
baseline="experiment_v1.json")
🔧 Implementation Goals¶
- Structural diff algorithms - efficient comparison of deep objects
- Configurable sensitivity - control what constitutes a "change"
- Storage optimization - significant space savings for incremental updates
- Git integration - meaningful diffs for version control
📈 Success Metrics¶
- 80%+ storage reduction for incremental model updates
- <100ms delta computation for typical ML models
- Support delta chains of 50+ steps without degradation
- Human-readable diffs in version control
v0.8.0 - Documentation & Ecosystem¶
"Make datason the easiest ML serialization library to adopt"
🎯 Unique Value Proposition¶
Complex ML serialization made simple through excellent documentation and examples.
# Comprehensive examples for every ML use case
import datason
# Interactive documentation with runnable examples
datason.examples.pytorch_model_serialization()
datason.examples.pandas_workflow()
datason.examples.sklearn_pipeline()
# Migration guides from other solutions
datason.migrate_from_pickle(existing_workflow)
datason.migrate_from_joblib(sklearn_artifacts)
🔧 Implementation Goals¶
- Comprehensive documentation - cover every use case
- Interactive examples - runnable code for all features
- Migration guides - help users switch from pickle/joblib
- Performance guides - help users optimize for their workloads
📈 Success Metrics¶
- 100% API documentation coverage
- Migration guides for 5+ popular alternatives
- Interactive examples for all major ML frameworks
- <5 minute time-to-first-success for new users
🚫 What We Won't Build¶
Schema Validation¶
- Why Not: Covered excellently by Pydantic, marshmallow, cerberus
- Instead: Focus on serializing whatever users already validate
Cloud Storage Integration¶
- Why Not: Adds dependencies, covered by cloud SDKs
- Instead: Focus on generating JSON that works with any storage
Cross-Format Support (Arrow, Protobuf)¶
- Why Not: Violates "human-friendly JSON" mission
- Instead: Perfect JSON serialization, let users convert if needed
Enterprise Features (Auth, Monitoring, etc.)¶
- Why Not: Adds complexity and dependencies far beyond core mission
- Instead: Integrate well with existing enterprise tools
Plugin System¶
- Why Not: Adds architectural complexity, can revisit post-v1.0
- Decision: Document for potential v1.0+ exploration, focus on core excellence first
🎯 Success Metrics¶
Technical Excellence¶
- Performance: Always <3x stdlib JSON for simple types
- Reliability: >99.9% uptime for critical ML workflows
- Coverage: Support 95%+ of common ML objects without dependencies
- Quality: 95%+ test coverage with zero critical production bugs
Adoption Goals¶
- v0.3.0: 5,000+ monthly active users in ML community
- v0.5.0: Standard tool in 3+ major ML frameworks' documentation
- v0.7.0: 50,000+ downloads, referenced in ML courses/tutorials
Community Impact¶
- Unique Value: Only JSON serializer that "just works" for ML objects
- Reliability: Teams trust datason for production ML pipelines
- Simplicity: Stays true to zero-dependency, minimal-complexity philosophy
🤝 Community & Feedback¶
Current Users¶
- Share your ML serialization pain points
- Report performance bottlenecks with real workloads
- Suggest ML frameworks we should prioritize
ML Framework Authors¶
- Partner on official integration examples
- Provide feedback on type handler implementations
- Help test edge cases in your frameworks
Enterprise Teams¶
- Share production use cases and requirements
- Provide feedback on configuration presets
- Help validate reliability improvements
Roadmap Principles: Stay focused, stay fast, stay simple, solve real problems
Last updated: JUne 2025 | Next review: Q3 2025