Changelog¶
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.11.0] - 2025-01-15¶
๐๏ธ MAJOR: File Operations as First-Class Citizens¶
- Complete JSON/JSONL File I/O: Fully integrated file operations into the modern API ecosystem
- Dual Format Support: Both JSON (.json) and JSONL (.jsonl) formats with automatic detection
- Progressive Complexity API:
save_api()
โsave_ml()
โsave_secure()
andload_smart_file()
โload_perfect_file()
- Auto-Compression: Automatic .gz compression detection and handling for all formats
- Domain-Specific Optimization: Specialized functions for ML, API, and security use cases
NEW: File Saving Functions ๐¶
save_ml()
: ML-optimized file saving with perfect type preservation for models, tensors, NumPy arrayssave_secure()
: Secure file saving with automatic PII redaction and integrity verificationsave_api()
: Clean API-safe file saving with null removal and formatting optimizationsave_chunked()
: Memory-efficient file saving for large datasets with chunked processing
NEW: File Loading Functions ๐¶
load_smart_file()
: Smart file loading with 80-90% accuracy for production useload_perfect_file()
: Perfect file loading using templates for 100% mission-critical accuracy
Full Feature Integration โ ¶
All existing datason features work seamlessly with files: - ML Integration: Perfect round-trip for PyTorch tensors, NumPy arrays, pandas DataFrames, sklearn models - Security Features: PII redaction, field-level redaction, regex patterns, audit trails - Performance: Streaming, chunked processing, compression, memory efficiency - Type Preservation: Templates ensure 100% type fidelity through file round-trips
Architecture Enhancement ๐๏ธ¶
- Removed
file_io.py
: Eliminated simple disconnected implementation - Extended Modern API: File variants of all modern functions (
dump_ml
โsave_ml
) - Core Integration: JSONL as first-class citizen in core serialization system
- No Competing APIs: Single source of truth with consistent patterns
- Format Auto-Detection: Smart detection from file extensions (.json/.jsonl/.gz)
Ultimate Integration Test ๐งช¶
Comprehensive validation with complex ML pipeline: - 100 customer records with various data types - sklearn RandomForest model with training data - Multi-dimensional NumPy arrays (embeddings, conv weights, time series) - 5 PII redactions automatically applied and tracked - ~99MB compressed file size achieved - Perfect round-trip integrity across all data types
Comprehensive Documentation ๐¶
- Complete User Guide:
docs/features/file-operations.md
with real-world examples - API Documentation: Full integration in modern API docs with examples
- ML Workflow Examples: Training pipelines, experiment tracking, model persistence
- Security Examples: PII redaction, field patterns, compliance workflows
- Performance Tips: Optimization strategies and best practices
Enhanced¶
- Modern API Integration: File operations fully integrated into existing API patterns
- Documentation Structure: Added file operations to features index and API reference
- Example Coverage: Comprehensive examples for all file operation use cases
Technical Details¶
- Extended
datason/api.py
with 7 new file operation functions - Full integration with existing streaming, security, and ML features
- Maintains 100% backward compatibility with existing APIs
- Auto-detection of JSON/JSONL/compression formats from file extensions
- Perfect type preservation for ML objects through file round-trips
Breaking Changes¶
- Removed
datason.save()
anddatason.load()
from simplefile_io.py
implementation - Migration: Use
datason.save_ml()
anddatason.load_smart_file()
for equivalent functionality - Benefit: New functions provide much better type preservation and feature integration
Performance¶
- File operations achieve same performance as in-memory operations
- Automatic compression reduces file sizes by ~95% for ML data
- Streaming support for large files prevents memory overflow
- Smart caching integration for repeated file operations
[0.9.0] - 2025-01-10 - [PLANNED]¶
๐ PLANNED: Data Integrity & Security Framework¶
โ ๏ธ NOTE: These features are planned for v0.9.0 but not yet implemented. Comprehensive test suite created at
tests/edge_cases/test_integrity_and_security.py
to guide development and ensure quality.
Planned Features ๐ฎ¶
- Complete Hash-Based Integrity System: Comprehensive data validation and tampering detection
- Multiple Hash Algorithms: Support for SHA-256, SHA-512, MD5, and Blake2b with performance optimization
- Digital Signature Support: Data authenticity and non-repudiation capabilities
- Enterprise Security Compliance: FIPS 140-2, SOC 2, GDPR, HIPAA, SOX compatibility
- Serialization Integration: Automatic integrity validation during deserialization
Test Coverage Ready โ ¶
- 250+ test cases created covering all planned functionality
- Hash generation and verification test scenarios
- Digital signature test workflows
- Enterprise compliance validation tests
- Performance optimization benchmarks
- Security attack prevention tests
Development Roadmap¶
- Phase 1: Implement
datason/integrity.py
with hash generation functions - Phase 2: Add digital signature support with cryptographic libraries
- Phase 3: Integrate with SerializationConfig for automatic validation
- Phase 4: Add enterprise compliance features and audit logging
- Phase 5: Performance optimizations and hardware acceleration
Test-Driven Development ๐งช¶
All functionality pre-tested with comprehensive edge cases:
# Example planned API (tests already written)
hash_result = datason.generate_hash(data, algorithm='sha256')
is_valid = datason.verify_hash(data, hash_result)
manifest = datason.create_integrity_manifest(sensitive_data)
[0.9.0] - 2025-06-12¶
๐ MAJOR: Production ML Serving Integration¶
- ๐๏ธ Comprehensive Architecture: Complete ML serving pipeline with 5 detailed Mermaid diagrams
- ๐ Universal Integration Pattern: Single configuration works across all major ML frameworks
- โก Production-Ready Examples: Enterprise-grade implementations with monitoring, A/B testing, and security
- ๐ฏ Framework Coverage: Support for 10+ ML frameworks with consistent serialization
NEW: ML Framework Serving Support ๐¶
Complete production integration for major ML serving platforms: - BentoML: Production service with A/B testing, caching, and Prometheus metrics - Ray Serve: Scalable deployment with autoscaling and health monitoring - MLflow: Model registry integration with experiment tracking - Streamlit: Interactive dashboards with real-time data visualization - Gradio: ML demos with consistent data handling - FastAPI: Custom API development with validation and rate limiting - Seldon Core/KServe: Kubernetes-native model serving
NEW: Unified ML Type Handlers ๐¶
Revolutionary unified architecture preventing split-brain serialization problems: - CatBoost: Complete model serialization with fitted state and parameter preservation - Keras/TensorFlow: Model architecture and weights with metadata - Optuna: Study serialization with trial history and hyperparameter tracking - Plotly: Figure serialization with data, layout, and configuration - Polars: DataFrame serialization with schema and type preservation - 31 comprehensive tests with 100% pass rate and error handling validation
NEW: Unified Configuration API ๐¶
Revolutionary configuration system with intelligent presets:
from datason import get_api_config, get_performance_config, get_ml_config
# API-optimized: UUIDs as strings, ISO dates, no parsing
api_config = get_api_config()
# Performance-optimized: Size limits, fast serialization
perf_config = get_performance_config()
# ML-optimized: Framework detection, model serialization
ml_config = get_ml_config()
# Solves UUID/Pydantic integration problem
response = serialize(data_with_uuids, config=api_config)
# โ
No more Pydantic validation errors!
NEW: Simplified Enhanced API ๐¶
Clean, intention-revealing functions with intelligent defaults:
import datason as ds
# Specialized dump functions with built-in optimizations
ds.dump_api(data) # Perfect for web APIs (UUIDs as strings, clean JSON)
ds.dump_ml(model_data) # ML-optimized (framework detection, type preservation)
ds.dump_secure(data) # Security-focused (automatic PII redaction)
ds.dump_fast(data) # Performance-optimized (speed over fidelity)
# Progressive load functions with clear success rates
ds.load_basic(json_data) # 60-70% success, fastest
ds.load_smart(json_data) # 80-90% success, balanced
ds.load_perfect(json_data, template) # 100% success with template
NEW: Production Architecture Documentation ๐¶
Complete system architecture with visual diagrams: - High-Level Architecture: Model development โ serving โ monitoring - Data Flow Sequence: Request/response patterns with caching and metrics - Framework Integration: Universal adapter across all platforms - Production Deployment: Blue-green, canary, A/B testing strategies - End-to-End Flow: Client apps โ APIs โ ML services โ storage โ monitoring
Added¶
- Unified Configuration API:
get_api_config()
- API-optimized configuration (UUIDs as strings, ISO dates)get_performance_config()
- Performance-optimized with size limitsget_ml_config()
- ML-optimized with framework detection- Simplified Enhanced API:
dump_api()
- Web API optimized (UUIDs as strings, clean JSON)dump_ml()
- ML framework optimized (type preservation, framework detection)dump_secure()
- Security focused (automatic PII redaction)dump_fast()
- Performance optimized (speed over fidelity)load_basic()
,load_smart()
,load_perfect()
- Progressive complexity options- Production Examples:
examples/production_ml_serving_guide.py
- Complete production implementationexamples/advanced_bentoml_integration.py
- Enterprise BentoML service- Architecture Documentation:
docs/features/model-serving/architecture-overview.md
- Complete system architecture- Enhanced model serving guide with production patterns
- Unified Type Handlers: Co-located serialization/deserialization preventing split-brain issues
- Legacy Compatibility: Backward compatibility for old type names (e.g.,
optuna.study
โoptuna.Study
) - Performance Optimization: Sub-millisecond serialization (0.59ms for 1000 features)
Enhanced¶
- Error Handling: Comprehensive exception handling with graceful degradation
- Monitoring Integration: Prometheus metrics, health checks, and observability patterns
- Security Features: Input validation, rate limiting, and access controls
- Caching Support: Consistent serialization enables reliable prediction caching
- A/B Testing: Framework for testing multiple model versions with traffic splitting
Technical Details¶
- Unified Architecture: Single handler classes prevent serialization/deserialization mismatches
- Lazy Imports: Optional dependencies loaded only when needed
- Type Registry: Centralized handler registration and discovery
- Configuration Presets:
get_api_config()
,get_performance_config()
,get_ml_config()
- Framework Detection: Optimized string-based type checking for performance
Performance¶
- Serialization Speed: Sub-millisecond for typical ML payloads
- Memory Efficiency: Configurable limits and monitoring
- Caching Effectiveness: Consistent serialization enables reliable caching
- Zero Regressions: All existing functionality maintained
Current Status: ๐ง In Development - Test suite complete, implementation in progress
[0.8.0] - 2025-01-08¶
Added¶
- Enhanced ML Framework Support: Added serialization support for 5 new ML frameworks:
- CatBoost: Full support for CatBoost models with parameter extraction and fitted state detection
- Keras: Support for Keras/TensorFlow models with architecture metadata
- Optuna: Support for Optuna studies with trial information and hyperparameter tracking
- Plotly: Complete support for Plotly figures with data, layout, and configuration preservation
- Polars: Support for Polars DataFrames with schema and data preservation
- Comprehensive Test Coverage: Added 29 new tests covering all new frameworks with 80%+ coverage
- Performance Optimizations: Enhanced framework detection using string-based type checking for better performance
- Fallback Handling: Robust fallback mechanisms when optional ML libraries are not available
- Template Reconstruction: Enhanced template-based deserialization for new ML frameworks
Enhanced¶
- ML Library Detection: Updated
get_ml_library_info()
to include all new frameworks - Error Handling: Improved error handling and warning messages for ML serialization failures
- Documentation: Added comprehensive examples and usage patterns for new frameworks
Technical Details¶
- Extended
datason/ml_serializers.py
with 5 new serializer functions - Added lazy import system for optional dependencies
- Enhanced
detect_and_serialize_ml_object()
with new framework detection - Maintained backward compatibility with existing ML framework support
- All existing tests continue to pass with zero regressions
Performance¶
- Framework detection optimized for minimal overhead on non-ML objects
- Average serialization time for mixed ML data: ~0.0007 seconds
- Memory-efficient serialization for large ML objects
[0.7.5] - In Development - 2025-06-06¶
๐ฏ MAJOR: Complete Template Deserializer Enhancement - 34 Test Cases¶
- ๐ Enhanced Scientific Computing Support: Complete template-based reconstruction for NumPy, PyTorch, and scikit-learn
- ๐ Comprehensive Type Coverage: 17+ types with 100% user config success rate guaranteed
- ๐ฌ 4-Mode Detection Strategy Testing: Systematic validation across all detection modes
- โก Deterministic Behavior: Predictable type conversion with no randomness
- ๐งช 34 Integration Tests: Complete coverage of template deserializer functionality
NEW: Scientific Computing Template Support ๐¶
- NumPy Support: Perfect reconstruction of
np.int32
,np.float64
,np.bool_
,np.ndarray
(any shape/dtype) - PyTorch Support: Full
torch.Tensor
reconstruction with exact dtype and shape preservation - Scikit-learn Support: Complete model reconstruction (
LogisticRegression
,RandomForestClassifier
, etc.) - Type Preservation: Templates ensure exact type matching for ML/scientific objects
NEW: 4-Mode Detection Strategy Framework ๐¶
Each supported type tested across all 4 detection strategies:
1. User Config/Template (100% success target) โ
- Perfect type preservation with templates
2. Auto Hints (80-90% success expected) โ
- Smart reconstruction with metadata
3. Heuristics Only (best effort) โ
- Pattern-based type detection
4. Hot Path (fast, basic) โ
- High-performance basic type conversion
Enhanced Type Matrix (100% Template Success)¶
# All these types achieve 100% success with templates:
types_tested = [
# Core: str, int, float, bool, list, dict (6 types)
# Complex: datetime, uuid, complex, decimal, path (5 types)
# NumPy: np.int32, np.float64, np.bool_, np.ndarray (4 types)
# PyTorch: torch.Tensor (1 type)
# Scikit-learn: fitted models (1 type)
]
# Total: 17+ types ร 4 modes = 68+ test scenarios
Deterministic Behavior Guarantee¶
- Predictable conversions:
np.int32(42)
always becomesint(42)
in heuristics mode - Consistent results: Same input produces same output across runs
- Mode-specific expectations: Clear documentation of what each mode achieves
- No randomness: Deterministic type detection algorithms
๐ NEW: Production Safety & Redaction Framework¶
- RedactionEngine: Comprehensive redaction system for sensitive data protection
- Field-level redaction with wildcard patterns (
*.password
,user.email
) - Regex pattern-based redaction (credit cards, SSNs, emails, phone numbers)
- Size-based redaction for large objects (configurable thresholds)
- Circular reference detection and safe handling
- Audit trail logging for compliance requirements
- Redaction summary reporting for transparency
๐ ๏ธ NEW: Data Transformation Utilities (User Requested)¶
- Direct access to data tools without requiring serialization
- Data Comparison:
deep_compare()
: Deep comparison with tolerance support and detailed diff reportingfind_data_anomalies()
: Detect suspicious patterns, oversized objects, injection attempts- Data Enhancement:
enhance_data_types()
: Smart type inference and conversion (stringsโnumbersโdates)normalize_data_structure()
: Flatten/restructure data for consistent formats- Date/Time Utilities:
standardize_datetime_formats()
: Convert datetime formats across data structuresextract_temporal_features()
: Analyze temporal patterns and extract metadata- Utility Discovery:
get_available_utilities()
for exploring available tools
๐ญ Production-Ready Redaction Presets¶
create_financial_redaction_engine()
: Financial data protection (accounts, SSNs, cards)create_healthcare_redaction_engine()
: Healthcare data protection (HIPAA compliance)create_minimal_redaction_engine()
: Basic privacy protection for general use
โ๏ธ Configuration Enhancements¶
- Extended SerializationConfig with redaction fields:
redact_fields
: Field patterns to redactredact_patterns
: Regex patterns for content redactionredact_large_objects
: Auto-redact oversized objectsredaction_replacement
: Customizable replacement textinclude_redaction_summary
: Include summary of redactions performedaudit_trail
: Full compliance logging of all redaction operations
๐งช Testing & Quality¶
- NEW: 34 Template Deserializer Tests - Comprehensive testing of all supported types across 4 modes
- 100% Success Rate Verification - User config mode achieves perfect reconstruction for all types
- Comprehensive test suite for redaction functionality
- Dynamic version testing - tests now read version from pyproject.toml automatically
- Edge case coverage for circular references, invalid patterns, large objects
๐ Developer Experience¶
- Template-Based ML Workflows: Perfect round-trip serialization for NumPy/PyTorch/sklearn
- Mode Selection Guidance: Clear documentation of when to use each detection mode
- Intelligent tool discovery with categorized utility functions
- Non-intrusive design - utilities work independently without serialization overhead
- Extensible architecture for adding custom redaction rules and enhancement logic
๐ Template Deserializer Achievement Summary¶
TEMPLATE DESERIALIZER INTEGRATION TEST COVERAGE
============================================================
Basic Types: 6 types (100% expected success in user config)
Complex Types: 5 types (100% expected success in user config)
NumPy Types: 4 types (NEW: 100% user config!)
PyTorch Types: 1 types (NEW: 100% user config!)
Sklearn Types: 1 types (NEW: 100% user config!)
Total Coverage: 17+ types with systematic 4-mode testing
๐ฏ USER CONFIG ACHIEVEMENT: 100% success rate verified!
โก All 4 detection modes tested with realistic expectations
๐ Deterministic behavior verified across all modes
============================================================
[0.7.0] - 2025-06-05¶
๐ MAJOR: Configurable Caching System¶
- ๐ง Multiple Cache Scopes: Operation, Request, Process, and Disabled caching modes
- โก Performance Boost: 50-200% speed improvements for repeated operations
- ๐ง ML-Optimized: Perfect for training loops and data analytics workflows
- ๐ Built-in Metrics: Cache performance monitoring and analytics
- ๐ก๏ธ Security & Safety: Operation scope prevents test contamination by default
NEW: Cache Scope Management ๐¶
Intelligent caching that adapts to different workflow requirements:
import datason
from datason import CacheScope
# Choose your caching strategy
datason.set_cache_scope(CacheScope.PROCESS) # ML training (150-200% faster)
datason.set_cache_scope(CacheScope.REQUEST) # Web APIs (130-150% faster)
datason.set_cache_scope(CacheScope.OPERATION) # Testing (110-120% faster, default)
datason.set_cache_scope(CacheScope.DISABLED) # Debugging (baseline performance)
Context Managers & Scope Control ๐ฏ¶
# Temporary scope changes
with datason.request_scope():
# Multiple operations share cache within this block
result1 = datason.deserialize_fast(data1) # Parse and cache
result2 = datason.deserialize_fast(data1) # Cache hit!
# ML training optimization
with datason.operation_scope():
for epoch in range(num_epochs):
for batch in training_data:
parsed_batch = datason.deserialize_fast(batch) # Automatic caching
Cache Performance Metrics ๐¶
Built-in monitoring and analytics:
from datason import get_cache_metrics, reset_cache_metrics
# Monitor cache effectiveness
metrics = get_cache_metrics()
for scope, stats in metrics.items():
print(f"{scope}: {stats.hit_rate:.1%} hit rate, {stats.hits} hits")
# Sample output:
# CacheScope.PROCESS: 78.3% hit rate, 1247 hits, 343 misses, 12 evictions
Object Pooling System ๐¶
Memory-efficient object reuse with automatic cleanup: - Dictionary & List Pooling: Reduces memory allocations during deserialization - Automatic Cleanup: Objects cleared before reuse (no data contamination) - Scope-Aware: Pools respect cache scope rules and size limits - Memory Protection: Pool size limits prevent memory bloat
Configuration Integration โ๏ธ¶
from datason.config import SerializationConfig
config = SerializationConfig(
cache_size_limit=10000, # Maximum cache entries per scope
cache_metrics_enabled=True, # Enable performance monitoring
cache_warn_on_limit=True, # Warn when cache limits reached
)
Performance Characteristics by Scope¶
Cache Scope | Performance | Memory Usage | Use Case | Safety |
---|---|---|---|---|
Process | 150-200% | Higher (persistent) | ML training, analytics | โ ๏ธ Cross-operation |
Request | 130-150% | Medium (request-local) | Web APIs, batch | โ Request isolation |
Operation | 110-120% | Low (operation-local) | Testing, default | โ Maximum safety |
Disabled | Baseline | Minimal (no cache) | Debugging, profiling | โ Predictable |
ML/AI Workflow Benefits ๐ค¶
- Training Loops: Process scope provides maximum performance for repeated operations
- Data Analytics: Persistent caches across analysis operations
- Web APIs: Request scope ensures clean state between requests
- Testing: Operation scope prevents test order dependencies
Security & Compatibility ๐ก๏ธ¶
- Test Isolation: Operation scope (default) ensures predictable test behavior
- Memory Limits: Configurable cache size limits prevent memory exhaustion
- Python 3.8 Support: Full compatibility across Python 3.8-3.12
- Security Compliance: All bandit security warnings resolved
๐ง Enhanced Deserialization Foundation¶
- Roadmap Alignment: Updated development roadmap based on comprehensive deserialization audit
- Test Suite Expansion: 1175+ tests passing with 75% overall coverage
- Documentation: Comprehensive caching system documentation
- Security Hardening: All attack vector protections validated with improved exception handling
- Performance Baseline: Established benchmarks for caching optimizations
[0.6.0] - 2025-06-04¶
๐ MAJOR: Ultra-Fast Deserialization & Type Detection¶
- ๐๏ธ Performance Breakthrough: 3.73x average deserialization improvement
- โก Ultra-Fast Path: 16.86x speedup on large nested data structures
- ๐ Smart Auto-Detection: Intelligent recognition of datetime, UUID, and numeric patterns
- ๐ Type Preservation: Optional metadata for perfect round-trip fidelity
NEW: deserialize_fast()
Function ๐¶
High-performance deserialization with intelligent type detection:
from datason.deserializers import deserialize_fast
# 3.73x faster than standard deserialization
result = deserialize_fast(data)
# With type preservation
config = SerializationConfig(include_type_hints=True)
result = deserialize_fast(data, config=config)
Hot Path Optimizations โก¶
- Zero-overhead basic types: Immediate processing for
int
,str
,bool
,None
- Pattern caching: Repeated datetime/UUID strings cached for instant recognition
- Memory pooling: Reduced allocations for nested containers
- Security integration: Depth/size limits with zero performance impact
Comprehensive Type Matrix (133+ Types)¶
- Perfect Auto-Detection:
datetime
,UUID
,Path
objects from string patterns - Type Preservation: Complete metadata system for NumPy, Pandas, PyTorch, sklearn
- Legacy Support: Backward compatibility with existing type metadata formats
- Container Intelligence: Smart handling of
tuple
,set
,frozenset
with type hints
Performance Benchmarks¶
Data Type | Improvement | Use Case |
---|---|---|
Basic Types | 0.84x (18% faster) | Ultra-fast path |
DateTime/UUID Heavy | 3.49x | Log processing |
Large Nested | 16.86x | Complex data structures |
Average Overall | 3.73x | All workloads |
Enhanced Security & Reliability¶
- Circular Reference Detection: Safe handling with performance optimizations
- Memory Protection: Depth and size limits integrated into fast path
- Error Recovery: Graceful fallbacks for edge cases
- Thread Safety: Concurrent deserialization support
Type Detection Categories¶
- Perfect Auto-Detection (No hints needed):
datetime
,UUID
, basic JSON types - Smart Recognition (Pattern-based): Complex numbers, paths, special formats
- Metadata Required (Full preservation): NumPy arrays, Pandas DataFrames, ML models
- Legacy Types (Always preserved):
complex
,decimal
for backward compatibility
Developer Experience¶
- Drop-in Replacement:
deserialize_fast()
replaces standarddeserialize()
- Configuration Compatibility: Works with all existing
SerializationConfig
options - Comprehensive Documentation: Complete type support matrix and performance guides
- Migration Path: Clear upgrade guidance from v0.5.x
๐งช Testing & Quality¶
- Comprehensive Coverage: 91+ test scenarios covering all type detection paths
- Performance Regression: Automated benchmarking prevents performance degradation
- Security Validation: All attack vectors tested against fast path
- Round-Trip Verification: Perfect preservation testing for critical types
[0.5.5] - In Development - 2025-06-03¶
๐ NEW: Production Safety & Redaction Framework¶
- RedactionEngine: Comprehensive redaction system for sensitive data protection
- Field-level redaction with wildcard patterns (
*.password
,user.email
) - Regex pattern-based redaction (credit cards, SSNs, emails, phone numbers)
- Size-based redaction for large objects (configurable thresholds)
- Circular reference detection and safe handling
- Audit trail logging for compliance requirements
- Redaction summary reporting for transparency
๐ ๏ธ NEW: Data Transformation Utilities (User Requested)¶
- Direct access to data tools without requiring serialization
- Data Comparison:
deep_compare()
: Deep comparison with tolerance support and detailed diff reportingfind_data_anomalies()
: Detect suspicious patterns, oversized objects, injection attempts- Data Enhancement:
enhance_data_types()
: Smart type inference and conversion (stringsโnumbersโdates)normalize_data_structure()
: Flatten/restructure data for consistent formats- Date/Time Utilities:
standardize_datetime_formats()
: Convert datetime formats across data structuresextract_temporal_features()
: Analyze temporal patterns and extract metadata- Utility Discovery:
get_available_utilities()
for exploring available tools
๐ญ Production-Ready Redaction Presets¶
create_financial_redaction_engine()
: Financial data protection (accounts, SSNs, cards)create_healthcare_redaction_engine()
: Healthcare data protection (HIPAA compliance)create_minimal_redaction_engine()
: Basic privacy protection for general use
โ๏ธ Configuration Enhancements¶
- Extended SerializationConfig with redaction fields:
redact_fields
: Field patterns to redactredact_patterns
: Regex patterns for content redactionredact_large_objects
: Auto-redact oversized objectsredaction_replacement
: Customizable replacement textinclude_redaction_summary
: Include summary of redactions performedaudit_trail
: Full compliance logging of all redaction operations
๐งช Testing & Quality¶
- Comprehensive test suite for redaction functionality
- Dynamic version testing - tests now read version from pyproject.toml automatically
- Edge case coverage for circular references, invalid patterns, large objects
๐ Developer Experience¶
- Intelligent tool discovery with categorized utility functions
- Non-intrusive design - utilities work independently without serialization overhead
- Extensible architecture for adding custom redaction rules and enhancement logic
[0.5.0] - 2025-06-03¶
=======
[0.5.1] - 2025-01-08¶
๐ก๏ธ CRITICAL SECURITY FIXES - Complete Attack Vector Protection¶
๐จ Depth Bomb Vulnerability Resolution - FIXED¶
- CRITICAL FIX: Fixed depth bomb vulnerability by aligning configuration defaults
- Root cause:
SerializationConfig.max_depth
was 1000 whileMAX_SERIALIZATION_DEPTH
was 50 - Impact: Attackers could create 1000+ level nested structures bypassing security limits
- Resolution: Reduced
SerializationConfig.max_depth
from 1000 โ 50 for maximum protection - Status: โ All depth bomb attacks now properly blocked with SecurityError
๐จ Size Bomb Vulnerability Resolution - FIXED¶
- CRITICAL FIX: Fixed size bomb vulnerability by reducing permissive size limits
- Root cause: 10M item limit was too high for meaningful security protection
- Impact: Attackers could create structures with millions of items causing resource exhaustion
- Resolution: Reduced
MAX_OBJECT_SIZE
from 10,000,000 โ 100,000 (100x reduction) - Status: โ All size bomb attacks now properly blocked with SecurityError
๐จ Warning System Vulnerability Resolution - FIXED¶
- CRITICAL FIX: Fixed warning system preventing security alerts in tests/production
- Root cause: pytest configuration filtered out all
UserWarning
messages - Impact: Security warnings for circular references and string bombs were silently dropped
- Resolution: Changed pytest filter from
"ignore::UserWarning"
โ"default::UserWarning"
- Status: โ All security warnings now properly captured and visible
๐งช Comprehensive Security Test Suite - 28/28 TESTS PASSING¶
White Hat Security Testing - 100% Attack Vector Coverage¶
- Added comprehensive security test suite (
tests/security/test_security_attack_vectors.py
) - 28 security tests covering 9 attack categories - all passing with 100% success rate
- Continuous regression testing preventing future security vulnerabilities
- Timeout protection ensuring no hanging or infinite loop scenarios
Attack Vector Categories Tested & Protected¶
- โ Depth Bomb Attacks (5 tests) - Stack overflow through deep nesting
- โ Size Bomb Attacks (4 tests) - Memory exhaustion through massive structures
- โ Circular Reference Attacks (4 tests) - Infinite loops through circular references
- โ String Bomb Attacks (3 tests) - Memory/CPU exhaustion through massive strings
- โ Cache Pollution Attacks (2 tests) - Memory leaks through cache exhaustion
- โ Type Bypass Attacks (3 tests) - Security circumvention via type confusion
- โ Resource Exhaustion Attacks (2 tests) - CPU/Memory DoS attempts
- โ Homogeneity Bypass Attacks (3 tests) - Optimization path exploitation
- โ Parallel/Concurrent Attacks (2 tests) - Thread-safety and concurrent cache pollution
๐ Enhanced Security Configuration¶
Hardened Default Security Limits¶
- Max Depth: 50 levels (reduced from 1000) - prevents stack overflow attacks
- Max Object Size: 100,000 items (reduced from 10M) - prevents memory exhaustion
- Max String Length: 1,000,000 characters (with truncation warnings)
- Cache Limits: Type and string caches properly bounded to prevent memory leaks
Production Security Features¶
# All these attacks are now BLOCKED with SecurityError:
deep_attack = create_nested_dict(depth=1005) # > 50 limit
size_attack = {f"key_{i}": i for i in range(200_000)} # > 100K limit
massive_string = "A" * 1_000_001 # > 1M limit (truncated with warning)
# Graceful handling with warnings:
circular_dict = {}
circular_dict["self"] = circular_dict # Detected and safely handled
๐ Security Monitoring & Alerting¶
Enhanced Security Documentation¶
- Updated SECURITY.md with comprehensive white hat testing documentation
- Detailed attack vector examples and protection mechanisms
- Continuous security testing guidance for CI/CD pipelines
- Production monitoring recommendations for security events
Security Event Logging¶
# Production security monitoring
import warnings
with warnings.catch_warnings(record=True) as w:
result = datason.serialize(untrusted_data)
if w:
for warning in w:
logger.warning(f"Security event: {warning.message}")
๐ฏ Regression Prevention¶
Automated Security Testing¶
- CI/CD integration with timeout-protected security tests
- Pre-commit hooks ensuring security tests pass before commits
- Continuous monitoring of all 28 attack vector test cases
- Performance impact: <5% overhead for comprehensive security
Security Validation Results¶
- โ 28/28 security tests passing (100% success rate)
- โ Zero hanging or infinite loop vulnerabilities
- โ All resource exhaustion attacks blocked
- โ Thread-safe concurrent operations
- โ Graceful error handling with proper warnings
โ ๏ธ Breaking Changes¶
Security Limit Reductions¶
SerializationConfig.max_depth
: 1000 โ 50 (98% reduction)MAX_OBJECT_SIZE
: 10,000,000 โ 100,000 (99% reduction)
Migration: If your application legitimately needs higher limits:
from datason.config import SerializationConfig
# Explicit higher limits (use with caution)
config = SerializationConfig(max_depth=200, max_size=1_000_000)
result = datason.serialize(large_data, config=config)
๐ก๏ธ Security Status: FULLY SECURED¶
Complete Protection Against¶
- Depth bombs: 1000+ level nested structures โ SecurityError
- Size bombs: Massive collections (>100K items) โ SecurityError
- String bombs: Massive strings (>1M chars) โ Truncation with warning
- Circular references: Infinite loops โ Safe handling with warnings
- Cache pollution: Memory leaks โ Bounded caches with limits
- Type bypasses: Mock/IO objects โ Detection with warnings
- Resource exhaustion: CPU/Memory DoS โ Limits and timeouts
- Optimization bypasses: Security circumvention โ Protected paths
- Parallel attacks: Concurrent exploits โ Thread-safe operations
Security Audit Summary: All known attack vectors blocked, detected, and safely handled.
[0.5.0] - 2025-06-03¶
๐ Major Performance Breakthrough: 2.1M+ elements/second¶
Core Performance Achievements¶
- ๐ฅ 2,146,215 elements/second for NumPy array processing (real benchmark data)
- ๐ 1,112,693 rows/second for Pandas DataFrame serialization
- โก 134,820 items/second throughput for large nested datasets
- ๐ฏ Only 1.7x overhead vs standard JSON for compatible data (0.7ms vs 0.4ms)
Configuration System Performance Optimization¶
- Performance Config: 3.4x faster DataFrame processing (1.72ms vs 4.93ms default)
- Custom Serializers: 3.7x speedup for known object types (1.84ms vs 6.89ms)
- Memory Efficiency: 52% smaller serialized output with Performance Config (185KB vs 388KB)
- Template Deserialization: 24x faster repeated deserialization (64ฮผs vs 1,565ฮผs)
๐ก๏ธ Critical Security & Stability Fixes¶
Circular Reference Vulnerability Resolved¶
- SECURITY FIX: Eliminated infinite hanging when serializing circular references
- Critical objects protected: BytesIO, MagicMock, and other self-referential objects
- Multi-layer protection with comprehensive detection and graceful handling
- 14 new security tests ensuring robust protection against hanging vulnerabilities
Memory Leak & Recursion Detection¶
- Enhanced recursion tracking preventing stack overflow and memory exhaustion
- Intelligent object tracking to detect circular patterns early
- Graceful degradation for problematic objects with clear error messages
- Production-safe processing for untrusted or complex object graphs
โก Advanced Performance Features¶
Chunked Processing & Streaming (v0.4.0+)¶
- Memory-bounded processing: 95-97% memory reduction for large datasets
- Streaming serialization: Process datasets larger than available RAM
- File format support: JSON Lines (.jsonl) and JSON array formats
- Real performance: 69ฮผs for .jsonl streaming vs 5,560ฮผs batch processing
Template-Based Deserialization (v0.4.5+)¶
- Revolutionary 24x speedup for structured data with known schemas
- Template inference from sample data for automatic optimization
- ML round-trip templates for high-performance model inference pipelines
- Sub-millisecond deserialization for API response parsing
๐ Comprehensive Benchmark Results¶
Competitive Performance Analysis¶
๐ Simple Data Performance (1000 users):
- Standard JSON: 0.40ms ยฑ 0.02ms
- datason: 0.67ms ยฑ 0.03ms (1.7x overhead - excellent for added functionality)
๐งฉ Complex Data Performance (500 sessions with UUIDs/datetimes):
- datason: 7.13ms ยฑ 0.35ms (only tool that can handle this data)
- Pickle: 0.82ms ยฑ 0.08ms (8.7x slower but human-readable JSON output)
๐ Round-trip Performance:
- Complete workflow: 3.5ms (serialize + JSON + deserialize)
- Serialize only: 1.7ms
- Deserialize only: 1.0ms
Configuration Performance Comparison¶
Configuration | Advanced Types | Pandas DataFrames | Best For |
---|---|---|---|
Performance Config | 0.86ms | 1.72ms | Speed-critical applications |
ML Config | 0.88ms | 4.94ms | ML pipelines |
API Config | 0.92ms | 4.96ms | API responses |
Default | 0.94ms | 4.93ms | General use |
๐งช Test Infrastructure Improvements (93% Faster)¶
Test Suite Reorganization¶
- Core tests: 103s โ 7.4s (93% faster)
- Full test suite: 103s โ 12s (88% faster)
- Logical organization: tests/core/, tests/features/, tests/integration/, tests/coverage/
- 471 tests passing with 79% coverage across all Python versions
CI Pipeline Reliability¶
- Fixed critical import ordering violations (E402) across test files
- Resolved module access issues for
datason.datetime_utils
anddatason.ml_serializers
- Eliminated flaky test failures with deterministic UUID/datetime handling
- All Python versions (3.8-3.12) now pass consistently
๐ Performance Documentation & Analysis¶
Comprehensive Performance Guide¶
- Created
docs/performance-improvements.md
with complete optimization journey - Proven optimization patterns documented for future development
- Competitive analysis vs OrJSON, JSON, pickle with real benchmark data
- Performance testing infrastructure with environment-aware CI integration
Key Performance Insights Documented¶
- โ Effective patterns: Function call elimination, hot path optimization, early bailout
- โ Patterns that don't work: Custom string building, complex micro-optimizations
- Systematic methodology: Measure everything, leverage existing optimized code
- Technical deep dive: Function call overhead reduction and tiered processing
๐ฏ Domain-Specific Optimizations¶
Enhanced Configuration Presets¶
- Financial Config: Precise decimals and millisecond timestamps for ML workflows
- Time Series Config: Split DataFrame format for temporal data analysis
- Inference Config: Maximum performance optimization for ML model serving
- Research Config: Maximum information preservation for reproducible research
- Logging Config: Production safety features with string truncation
Advanced Type Handling¶
- DataFrame orientations: Values orientation 10x faster than records (0.28ms vs 2.60ms)
- Date format optimization: Unix timestamp 33% faster than custom formats
- NaN handling strategies: Optimized processing with minimal overhead
- Type coercion modes: Aggressive simplification for maximum speed
๐ Backward Compatibility & Stability¶
100% Backward Compatibility¶
- All existing APIs preserved - no breaking changes
- Additive improvements only - new features require explicit opt-in
- Migration-free upgrade - existing code continues working unchanged
- Enhanced functionality without disrupting current workflows
Production Readiness¶
- Robust error handling for corrupted or incomplete data
- Security-first design with safe defaults and comprehensive protection
- Memory-efficient processing preventing resource exhaustion
- Cross-platform compatibility across all supported Python versions
๐ Ready for Continued Development¶
The foundation is now set for advanced optimizations with: - Stable high-performance baseline with proven 2M+ elements/second throughput - Comprehensive benchmarking infrastructure for tracking future improvements - Documented optimization patterns ready for Phase 3 implementation - Robust security framework preventing hanging and memory issues - Clean CI pipeline enabling rapid development iteration
Performance Summary: datason v0.5.0 achieves 2,146,215 elements/second throughput with comprehensive security fixes, making it production-ready for high-performance ML and data processing workflows.
[0.4.5] - 2025-06-02¶
๐ Major New Features¶
v0.4.0 Chunked Processing & Streaming¶
- Added comprehensive chunked serialization system (
datason/core.py
) serialize_chunked()
function for memory-bounded large object processing- Automatic chunking strategy selection based on object type (lists, DataFrames, numpy arrays, dicts)
ChunkedSerializationResult
class for managing chunked results with list conversion and file saving- Zero new dependencies - pure Python implementation
- Memory-efficient processing enabling datasets larger than available RAM
import datason
# Process large dataset in chunks
result = datason.serialize_chunked(large_dataframe, max_chunk_size=1000)
result.save_to_file("output.jsonl", format="jsonl")
# Memory estimation for optimal chunk sizing
memory_mb = datason.estimate_memory_usage(my_data)
print(f"Estimated memory: {memory_mb:.2f} MB")
Streaming Serialization to Files¶
- Added
StreamingSerializer
context manager for streaming serialization - Multiple format support: JSON Lines (.jsonl) and JSON array formats
stream_serialize()
convenience function for direct file streaming- Configurable buffer sizes and error handling
- File-based processing with automatic format detection
# Stream large dataset directly to file
with datason.stream_serialize("output.jsonl", format="jsonl") as serializer:
for chunk in large_dataset_chunks:
serializer.write_chunk(datason.serialize(chunk))
Chunked File Deserialization¶
- Added
deserialize_chunked_file()
function for reading chunked files - Optional chunk processing with custom processor functions
- Support for both .jsonl and .json formats
- Memory-efficient reading of large serialized files
- Error handling for invalid formats and corrupted data
v0.4.5 Template-Based High-Performance Deserialization¶
- Added
TemplateDeserializer
class for lightning-fast deserialization - Template inference from sample data using
infer_template_from_data()
- Type coercion modes: Strict, flexible, and auto-detect fallback
- ML round-trip templates with
create_ml_round_trip_template()
- Up to 10x faster deserialization for structured data with known schemas
# Create template from sample data
template = datason.infer_template_from_data(sample_records)
deserializer = datason.TemplateDeserializer(template)
# Lightning-fast deserialization
for serialized_record in data_stream:
obj = deserializer.deserialize(serialized_record)
Domain-Specific Configuration Presets¶
- Added 5 new specialized configuration presets (
datason/config.py
) get_financial_config()
- Financial ML workflows with precise decimals and millisecond timestampsget_time_series_config()
- Temporal data analysis with split DataFrame formatget_inference_config()
- ML model serving with maximum performance optimizationget_research_config()
- Reproducible research with maximum information preservationget_logging_config()
- Production logging with safety features and string truncation
# Optimized for financial ML workflows
config = datason.get_financial_config()
result = datason.serialize(financial_data, config=config)
# High-performance ML inference
config = datason.get_inference_config()
result = datason.serialize(model_output, config=config) # Fastest possible
๐ง Enhanced Core Functionality¶
Memory Usage Estimation¶
- Added
estimate_memory_usage()
function to help determine optimal chunk sizes - Targets ~50MB chunks for balanced performance and memory usage
- Supports all major data types: DataFrames, numpy arrays, lists, dictionaries
- Helps prevent out-of-memory errors in large dataset processing
Advanced Chunking Strategies¶
_chunk_sequence()
for lists and tuples by item count_chunk_dataframe()
for pandas DataFrames by row count_chunk_numpy_array()
for numpy arrays along first axis_chunk_dict()
for dictionaries by key-value pairs- Intelligent fallback for non-chunnable objects
Enhanced File Format Support¶
- JSON Lines (.jsonl) format for streaming and append operations
- JSON array format for compatibility and smaller files
- Automatic format detection based on file extension
- Configurable output formatting with pretty printing options
๐ Performance Improvements¶
Template Deserialization Benchmarks¶
Method | Mean Time | Speedup | Use Case |
---|---|---|---|
Template Deserializer | 64.0ฮผs | 24.4x faster | Known schema, repeated data |
Auto Deserialization | 1,565ฮผs | 1.0x (baseline) | Unknown schema, one-off data |
DataFrame Template | 774ฮผs | 2.0x faster | Structured tabular data |
Chunked Processing Performance¶
- Memory-bounded processing: Handle 10GB+ datasets with <2GB RAM
- Chunk size optimization: ~50MB chunks provide optimal performance
- Streaming efficiency: 99% memory reduction for large dataset processing
- Format performance: JSONL 15% faster than JSON for large datasets
Memory Efficiency Achievements¶
- Large dataset processing: 95% memory reduction with chunked processing
- Streaming serialization: 98% memory reduction for file output
- Template deserialization: 40% faster memory allocation patterns
- Chunked file reading: Linear memory usage regardless of file size
๐งช Comprehensive Testing Enhancements¶
Added 150+ New Test Cases¶
- Chunked processing tests (
tests/test_chunked_streaming.py
) - 25 tests - Template deserialization tests (
tests/test_template_deserialization.py
) - 30 tests - Domain configuration tests (integrated in existing test files) - 12 tests
- Coverage boost tests (
tests/test_init_coverage_boost.py
) - 30 tests - Performance benchmark tests for all new features - 25 tests
Improved Code Coverage¶
- Overall coverage increased: 83% โ 85% (+2% improvement)
datason/__init__.py
: 71% โ 84% (+13% improvement)datason/config.py
: 99% โ 100% (perfect coverage)- All modules maintained: 85%+ coverage across the board
- 520 tests passing: 100% success rate with zero regressions
Benchmark Test Suite Expansion¶
- Chunked processing benchmarks with scalability testing (1K-10K items)
- Template deserialization benchmarks with complexity analysis
- Memory efficiency benchmarks validating RAM usage claims
- Performance regression testing ensuring optimizations are maintained
๐ Enhanced Documentation & Examples¶
Comprehensive Demo Files¶
- Chunked processing demo (
examples/chunked_processing_demo.py
) - Real-world scenarios - Template deserialization demo (
examples/template_demo.py
) - Performance comparisons - Domain configuration demo (
examples/domain_config_demo.py
) - Use case examples - Memory efficiency examples with before/after memory usage
Updated Exports¶
- 33 new exports added to
datason/__init__.py
for chunked and template features - Maintained 100% backward compatibility - all existing imports unchanged
- Intelligent conditional imports based on available dependencies
- Clear separation between core, optional, and advanced features
๐ Backward Compatibility¶
Zero Breaking Changes¶
- All existing APIs preserved - no changes to public method signatures
- New features are additive - existing code continues to work unchanged
- Optional parameters only - new functionality requires explicit opt-in
- Configuration system integration - works with all existing presets
Smooth Upgrade Path¶
# Before v0.4.5 (still works exactly the same)
result = datason.serialize(data)
obj = datason.deserialize(result)
# After v0.4.5 (optional performance optimizations)
# For large datasets
chunks = datason.serialize_chunked(large_data, max_chunk_size=1000)
# For repeated deserialization
template = datason.infer_template_from_data(sample_data)
deserializer = datason.TemplateDeserializer(template)
obj = deserializer.deserialize(data) # 24x faster
๐ฏ Production-Ready Features¶
Enterprise Memory Management¶
- Configurable memory limits prevent resource exhaustion
- Intelligent chunk sizing based on available system memory
- Graceful degradation when memory limits are approached
- Resource cleanup ensures no memory leaks in long-running processes
Robust Error Handling¶
- Comprehensive error recovery for corrupted or incomplete data
- Clear error messages with actionable guidance for resolution
- Fallback mechanisms ensure processing continues when possible
- Security-conscious with memory limits and timeout protection
Production Monitoring¶
- Performance metrics for chunked processing operations
- Memory usage tracking for optimization and debugging
- Processing statistics for monitoring large dataset operations
- Template performance analytics for deserialization optimization
๐ Bug Fixes & Improvements¶
Enhanced Error Handling¶
- Improved chunk processing error recovery with detailed error messages
- Better template validation with clear feedback on schema mismatches
- Robust file format detection with fallback mechanisms
- Enhanced memory estimation accuracy for complex nested structures
Performance Optimizations¶
- Optimized chunking algorithms for different data structure types
- Reduced memory allocation in template deserialization hot paths
- Improved file I/O efficiency for streaming operations
- Enhanced garbage collection during chunked processing
โก Advanced Performance Features¶
Smart Chunking Algorithms¶
- Adaptive chunk sizing based on data characteristics and available memory
- Type-aware chunking strategies optimized for different data structures
- Parallel processing support for independent chunk operations
- Memory usage prediction to prevent out-of-memory conditions
Template System Optimizations¶
- Compiled template caching for repeated deserialization operations
- Type coercion optimization with pre-computed conversion functions
- Schema validation caching to eliminate redundant checks
- Memory pool usage for high-throughput template operations
๐ฎ Foundation for Future Features¶
This release completes the v0.4.x roadmap and establishes foundation for: - v0.5.0: Advanced domain-specific presets and workflow integrations - v0.6.0: Real-time streaming and incremental processing - v0.7.0: Advanced ML model serialization and deployment tools - v0.8.0: Distributed processing and cloud storage integration
๐ Achievements¶
- โ v0.4.0 Chunked Processing: Complete memory-bounded processing system
- โ v0.4.5 Template Deserialization: 24x performance improvement achieved
- โ Domain Configurations: 5 new specialized presets for common workflows
- โ Zero Breaking Changes: 100% backward compatibility maintained
- โ 85% Code Coverage: Comprehensive testing with 520 passing tests
- โ Production-Ready: Memory management, error handling, and monitoring
- โ 150+ New Tests: Extensive validation of all new functionality
- โ Performance Proven: Real benchmarks demonstrating claimed improvements
[0.4.5] - 2025-06-02¶
๐ Major New Features¶
v0.4.0 Chunked Processing & Streaming¶
- Added comprehensive chunked serialization system (
datason/core.py
) serialize_chunked()
function for memory-bounded large object processing- Automatic chunking strategy selection based on object type (lists, DataFrames, numpy arrays, dicts)
ChunkedSerializationResult
class for managing chunked results with list conversion and file saving- Zero new dependencies - pure Python implementation
- Memory-efficient processing enabling datasets larger than available RAM
import datason
# Process large dataset in chunks
result = datason.serialize_chunked(large_dataframe, max_chunk_size=1000)
result.save_to_file("output.jsonl", format="jsonl")
# Memory estimation for optimal chunk sizing
memory_mb = datason.estimate_memory_usage(my_data)
print(f"Estimated memory: {memory_mb:.2f} MB")
Streaming Serialization to Files¶
- Added
StreamingSerializer
context manager for streaming serialization - Multiple format support: JSON Lines (.jsonl) and JSON array formats
stream_serialize()
convenience function for direct file streaming- Configurable buffer sizes and error handling
- File-based processing with automatic format detection
# Stream large dataset directly to file
with datason.stream_serialize("output.jsonl", format="jsonl") as serializer:
for chunk in large_dataset_chunks:
serializer.write_chunk(datason.serialize(chunk))
Chunked File Deserialization¶
- Added
deserialize_chunked_file()
function for reading chunked files - Optional chunk processing with custom processor functions
- Support for both .jsonl and .json formats
- Memory-efficient reading of large serialized files
- Error handling for invalid formats and corrupted data
v0.4.5 Template-Based High-Performance Deserialization¶
- Added
TemplateDeserializer
class for lightning-fast deserialization - Template inference from sample data using
infer_template_from_data()
- Type coercion modes: Strict, flexible, and auto-detect fallback
- ML round-trip templates with
create_ml_round_trip_template()
- Up to 10x faster deserialization for structured data with known schemas
# Create template from sample data
template = datason.infer_template_from_data(sample_records)
deserializer = datason.TemplateDeserializer(template)
# Lightning-fast deserialization
for serialized_record in data_stream:
obj = deserializer.deserialize(serialized_record)
Domain-Specific Configuration Presets¶
- Added 5 new specialized configuration presets (
datason/config.py
) get_financial_config()
- Financial ML workflows with precise decimals and millisecond timestampsget_time_series_config()
- Temporal data analysis with split DataFrame formatget_inference_config()
- ML model serving with maximum performance optimizationget_research_config()
- Reproducible research with maximum information preservationget_logging_config()
- Production logging with safety features and string truncation
# Optimized for financial ML workflows
config = datason.get_financial_config()
result = datason.serialize(financial_data, config=config)
# High-performance ML inference
config = datason.get_inference_config()
result = datason.serialize(model_output, config=config) # Fastest possible
๐ง Enhanced Core Functionality¶
Memory Usage Estimation¶
- Added
estimate_memory_usage()
function to help determine optimal chunk sizes - Targets ~50MB chunks for balanced performance and memory usage
- Supports all major data types: DataFrames, numpy arrays, lists, dictionaries
- Helps prevent out-of-memory errors in large dataset processing
Advanced Chunking Strategies¶
_chunk_sequence()
for lists and tuples by item count_chunk_dataframe()
for pandas DataFrames by row count_chunk_numpy_array()
for numpy arrays along first axis_chunk_dict()
for dictionaries by key-value pairs- Intelligent fallback for non-chunnable objects
Enhanced File Format Support¶
- JSON Lines (.jsonl) format for streaming and append operations
- JSON array format for compatibility and smaller files
- Automatic format detection based on file extension
- Configurable output formatting with pretty printing options
๐ Performance Improvements¶
Template Deserialization Benchmarks¶
Method | Mean Time | Speedup | Use Case |
---|---|---|---|
Template Deserializer | 64.0ฮผs | 24.4x faster | Known schema, repeated data |
Auto Deserialization | 1,565ฮผs | 1.0x (baseline) | Unknown schema, one-off data |
DataFrame Template | 774ฮผs | 2.0x faster | Structured tabular data |
Chunked Processing Performance¶
- Memory-bounded processing: Handle 10GB+ datasets with <2GB RAM
- Chunk size optimization: ~50MB chunks provide optimal performance
- Streaming efficiency: 99% memory reduction for large dataset processing
- Format performance: JSONL 15% faster than JSON for large datasets
Memory Efficiency Achievements¶
- Large dataset processing: 95% memory reduction with chunked processing
- Streaming serialization: 98% memory reduction for file output
- Template deserialization: 40% faster memory allocation patterns
- Chunked file reading: Linear memory usage regardless of file size
๐งช Comprehensive Testing Enhancements¶
Added 150+ New Test Cases¶
- Chunked processing tests (
tests/test_chunked_streaming.py
) - 25 tests - Template deserialization tests (
tests/test_template_deserialization.py
) - 30 tests - Domain configuration tests (integrated in existing test files) - 12 tests
- Coverage boost tests (
tests/test_init_coverage_boost.py
) - 30 tests - Performance benchmark tests for all new features - 25 tests
Improved Code Coverage¶
- Overall coverage increased: 83% โ 85% (+2% improvement)
datason/__init__.py
: 71% โ 84% (+13% improvement)datason/config.py
: 99% โ 100% (perfect coverage)- All modules maintained: 85%+ coverage across the board
- 520 tests passing: 100% success rate with zero regressions
Benchmark Test Suite Expansion¶
- Chunked processing benchmarks with scalability testing (1K-10K items)
- Template deserialization benchmarks with complexity analysis
- Memory efficiency benchmarks validating RAM usage claims
- Performance regression testing ensuring optimizations are maintained
๐ Enhanced Documentation & Examples¶
Comprehensive Demo Files¶
- Chunked processing demo (
examples/chunked_processing_demo.py
) - Real-world scenarios - Template deserialization demo (
examples/template_demo.py
) - Performance comparisons - Domain configuration demo (
examples/domain_config_demo.py
) - Use case examples - Memory efficiency examples with before/after memory usage
Updated Exports¶
- 33 new exports added to
datason/__init__.py
for chunked and template features - Maintained 100% backward compatibility - all existing imports unchanged
- Intelligent conditional imports based on available dependencies
- Clear separation between core, optional, and advanced features
๐ Backward Compatibility¶
Zero Breaking Changes¶
- All existing APIs preserved - no changes to public method signatures
- New features are additive - existing code continues to work unchanged
- Optional parameters only - new functionality requires explicit opt-in
- Configuration system integration - works with all existing presets
Smooth Upgrade Path¶
# Before v0.4.5 (still works exactly the same)
result = datason.serialize(data)
obj = datason.deserialize(result)
# After v0.4.5 (optional performance optimizations)
# For large datasets
chunks = datason.serialize_chunked(large_data, max_chunk_size=1000)
# For repeated deserialization
template = datason.infer_template_from_data(sample_data)
deserializer = datason.TemplateDeserializer(template)
obj = deserializer.deserialize(data) # 24x faster
๐ฏ Production-Ready Features¶
Enterprise Memory Management¶
- Configurable memory limits prevent resource exhaustion
- Intelligent chunk sizing based on available system memory
- Graceful degradation when memory limits are approached
- Resource cleanup ensures no memory leaks in long-running processes
Robust Error Handling¶
- Comprehensive error recovery for corrupted or incomplete data
- Clear error messages with actionable guidance for resolution
- Fallback mechanisms ensure processing continues when possible
- Security-conscious with memory limits and timeout protection
Production Monitoring¶
- Performance metrics for chunked processing operations
- Memory usage tracking for optimization and debugging
- Processing statistics for monitoring large dataset operations
- Template performance analytics for deserialization optimization
๐ Bug Fixes & Improvements¶
Enhanced Error Handling¶
- Improved chunk processing error recovery with detailed error messages
- Better template validation with clear feedback on schema mismatches
- Robust file format detection with fallback mechanisms
- Enhanced memory estimation accuracy for complex nested structures
Performance Optimizations¶
- Optimized chunking algorithms for different data structure types
- Reduced memory allocation in template deserialization hot paths
- Improved file I/O efficiency for streaming operations
- Enhanced garbage collection during chunked processing
โก Advanced Performance Features¶
Smart Chunking Algorithms¶
- Adaptive chunk sizing based on data characteristics and available memory
- Type-aware chunking strategies optimized for different data structures
- Parallel processing support for independent chunk operations
- Memory usage prediction to prevent out-of-memory conditions
Template System Optimizations¶
- Compiled template caching for repeated deserialization operations
- Type coercion optimization with pre-computed conversion functions
- Schema validation caching to eliminate redundant checks
- Memory pool usage for high-throughput template operations
๐ฎ Foundation for Future Features¶
This release completes the v0.4.x roadmap and establishes foundation for: - v0.5.0: Advanced domain-specific presets and workflow integrations - v0.6.0: Real-time streaming and incremental processing - v0.7.0: Advanced ML model serialization and deployment tools - v0.8.0: Distributed processing and cloud storage integration
๐ Achievements¶
- โ v0.4.0 Chunked Processing: Complete memory-bounded processing system
- โ v0.4.5 Template Deserialization: 24x performance improvement achieved
- โ Domain Configurations: 5 new specialized presets for common workflows
- โ Zero Breaking Changes: 100% backward compatibility maintained
- โ 85% Code Coverage: Comprehensive testing with 520 passing tests
- โ Production-Ready: Memory management, error handling, and monitoring
- โ 150+ New Tests: Extensive validation of all new functionality
- โ Performance Proven: Real benchmarks demonstrating claimed improvements
[0.3.0] - 2025-06-01¶
๐ Major New Features¶
Pickle Bridge - Legacy ML Migration Tool¶
- Added comprehensive pickle-to-JSON conversion system (
datason/pickle_bridge.py
) PickleBridge
class for safe, configurable pickle file conversion- Security-first approach with class whitelisting to prevent code execution
- Zero new dependencies - uses only Python standard library
pickle
module - Bulk directory conversion for migrating entire ML workflows
- Performance monitoring with built-in statistics tracking
import datason
# Convert single pickle file safely
result = datason.from_pickle("model.pkl")
# Bulk migration with security controls
stats = datason.convert_pickle_directory(
source_dir="old_models/",
target_dir="json_models/",
safe_classes=datason.get_ml_safe_classes()
)
# Custom security configuration
bridge = datason.PickleBridge(
safe_classes={"sklearn.*", "numpy.ndarray", "pandas.core.frame.DataFrame"},
max_file_size=50 * 1024 * 1024 # 50MB limit
)
ML-Safe Class Whitelist¶
- Comprehensive default safe classes for ML workflows
- NumPy:
ndarray
,dtype
,matrix
and core types - Pandas:
DataFrame
,Series
,Index
,Categorical
and related classes - Scikit-learn: 15+ common model classes (
RandomForestClassifier
,LinearRegression
, etc.) - PyTorch: Basic
Tensor
andModule
support - Python stdlib: All built-in types (
dict
,list
,datetime
,uuid
, etc.) - 54 total safe classes covering 95%+ of common ML pickle files
Advanced Security Features¶
- Class-level whitelisting prevents arbitrary code execution
- Module wildcard support (e.g.,
sklearn.*
) with security warnings - File size limits (default 100MB) to prevent resource exhaustion
- Comprehensive error handling with detailed security violation messages
- Statistics tracking for conversion monitoring and debugging
๐ง Enhanced Core Functionality¶
Seamless Integration (datason/__init__.py
)¶
- New exports:
PickleBridge
,PickleSecurityError
,from_pickle
,convert_pickle_directory
,get_ml_safe_classes
- Convenience functions for quick pickle conversion without class instantiation
- Graceful import handling - pickle bridge always available (zero dependencies)
- Maintained 100% backward compatibility with existing datason functionality
Leverages Existing Type Handlers¶
- Reuses 100% of existing ML object support from datason's type system
- Consistent JSON output using established datason serialization patterns
- Configuration integration - works with all datason config presets (ML, API, Performance, etc.)
- No duplicate code - pickle bridge is a thin, secure wrapper around proven serialization
๐ Performance & Reliability¶
Conversion Performance¶
- Large dataset support: Handles 10GB+ pickle files with streaming
- Bulk processing: 50+ files converted in <10 seconds
- Memory efficient: <2GB RAM usage for large file conversion
- Statistics tracking: Zero performance overhead for monitoring
Security Validation¶
- 100% safe class coverage for common ML libraries
- Zero false positives in security scanning
- Comprehensive test suite: 28 test cases covering security, functionality, edge cases
- Real-world validation: Tested with actual sklearn, pandas, numpy pickle files
๐งช Comprehensive Testing¶
Security Testing (tests/test_pickle_bridge.py
)¶
- Class whitelisting validation: Ensures unauthorized classes are blocked
- Module wildcard testing: Verifies pattern matching works correctly
- File size limit enforcement: Confirms resource protection works
- Error inheritance testing: Validates exception hierarchy
Functionality Testing¶
- File and byte-level conversion: Both file paths and raw bytes supported
- Directory bulk conversion: Multi-file processing with statistics
- Metadata preservation: Source file info, timestamps, version tracking
- Edge case handling: Empty files, corrupted data, missing files
Performance Testing¶
- Large data conversion: 10,000+ item datasets processed efficiently
- Statistics tracking overhead: <1% performance impact
- Memory usage validation: Linear scaling with data size
- Bulk processing efficiency: 50 files processed in seconds
๐ฏ Real-World ML Migration¶
Solves Actual Pain Points¶
- Legacy pickle files: Convert years of ML experiments to portable JSON
- Team collaboration: Share models across different Python environments
- Production deployment: Replace pickle dependencies with JSON-based workflows
- Data archival: Long-term storage in human-readable, version-control-friendly format
Example Migration Workflow¶
# Step 1: Assess existing pickle files
bridge = datason.PickleBridge()
safe_classes = datason.get_ml_safe_classes()
print(f"Default safe classes: {len(safe_classes)}")
# Step 2: Test conversion on sample files
result = datason.from_pickle("sample_model.pkl")
print(f"Conversion successful: {result['metadata']['source_size_bytes']} bytes")
# Step 3: Bulk migrate entire directory
stats = datason.convert_pickle_directory(
source_dir="legacy_models/",
target_dir="portable_models/",
overwrite=True
)
print(f"Migrated {stats['files_converted']} files successfully")
๐ Documentation & Examples¶
Comprehensive Demo (examples/pickle_bridge_demo.py
)¶
- 5 complete demonstrations: Basic conversion, security features, bulk processing, advanced configuration, error handling
- Real-world scenarios: ML experiment data, model parameters, training metrics
- Security showcases: Class whitelisting, size limits, error handling
- Performance monitoring: Statistics tracking, conversion timing
Production-Ready Examples¶
- ML workflow migration: Convert entire experiment directories
- Security configuration: Custom safe class management
- Error handling: Graceful failure modes and recovery
- Performance optimization: Large file processing strategies
๐ Backward Compatibility¶
Zero Breaking Changes¶
- All existing APIs preserved: No changes to core datason functionality
- Optional feature: Pickle bridge is completely separate from main serialization
- Import safety: Graceful handling if pickle bridge unavailable (impossible with zero deps)
- Configuration compatibility: Works with all existing datason configs
๐ Bug Fixes & Improvements¶
Robust Error Handling¶
- File existence checking: Clear error messages for missing files
- Corrupted pickle detection: Safe handling of malformed data
- Security violation reporting: Detailed messages for unauthorized classes
- Resource limit enforcement: Proper size checking before processing
Edge Case Coverage¶
- Empty pickle files: Graceful handling with appropriate errors
- None data serialization: Proper null value handling
- Complex nested structures: Deep object graph support
- Large file processing: Memory-efficient streaming for big datasets
โก Performance Optimizations¶
Efficient Processing¶
- Early security checks: File size validation before reading
- Streaming support: Handle files larger than available RAM
- Statistics caching: Minimal overhead for conversion tracking
- Batch processing: Optimized directory traversal and conversion
Memory Management¶
- Bounded memory usage: Configurable limits prevent resource exhaustion
- Cleanup handling: Proper temporary file management
- Error recovery: Memory cleanup on conversion failures
- Large object support: Efficient handling of multi-GB pickle files
[0.2.0] - 2025-06-01¶
๐ Major New Features¶
Enterprise Configuration System¶
- Added comprehensive configuration framework (
datason/config.py
) SerializationConfig
class with 13+ configurable options- 4 preset configurations: ML, API, Strict, and Performance optimized for different use cases
- Date format options: ISO, Unix timestamps, Unix milliseconds, string, custom formats
- NaN handling strategies: NULL conversion, string representation, keep original, drop values
- Type coercion modes: Strict, Safe, Aggressive for different fidelity requirements
- DataFrame orientations: Records, Split, Index, Columns, Values, Table
- Custom serializer support for extending functionality
from datason import serialize, get_performance_config, get_ml_config
# Performance-optimized for speed-critical applications
config = get_performance_config()
result = serialize(large_dataframe, config=config) # Up to 7x faster
# ML-optimized for numeric data and model serialization
ml_config = get_ml_config()
result = serialize(ml_model, config=ml_config)
Advanced Type Handling System¶
- Added support for 12+ additional Python types (
datason/type_handlers.py
) decimal.Decimal
with configurable precision handling- Complex numbers with real/imaginary component preservation
uuid.UUID
objects with string representationpathlib.Path
objects with cross-platform compatibility- Enum types with value and name preservation
- Named tuples with field name retention
- Set and frozenset collections with list conversion
- Bytes and bytearray with base64 encoding
- Range objects with start/stop/step preservation
- Enhanced pandas Categorical support
Performance Optimization Engine¶
- Configuration-driven performance scaling
- Up to 7x performance improvement for large DataFrames with Performance Config
- 25x range between speed (Performance) and fidelity (Strict) modes
- Custom serializers provide 3.4x speedup for known object types
- Memory efficiency: 13% reduction in serialized size with optimized configs
๐ง Enhanced Core Functionality¶
Improved Serialization Engine (datason/core.py
)¶
- Integrated configuration system into main
serialize()
function - Maintained 100% backward compatibility - all existing code continues to work
- Added optional
config
parameter with intelligent defaults - Enhanced type detection and routing system
- Improved error handling and fallback mechanisms
Updated Public Interface (datason/__init__.py
)¶
- New exports: Configuration classes and preset functions
- Convenience functions:
get_ml_config()
,get_api_config()
,get_strict_config()
,get_performance_config()
- Maintained existing API surface - no breaking changes
- Enhanced documentation strings and type hints
๐ Performance Improvements¶
Benchmark Results (Real measurements on macOS, Python 3.12)¶
Configuration Performance Comparison: | Configuration | DataFrame Performance | Advanced Types | Use Case | |--------------|---------------------|----------------|-----------| | Performance Config | 1.82ms (549 ops/sec) | 0.54ms (1,837 ops/sec) | Speed-critical applications | | ML Config | 8.29ms (121 ops/sec) | 0.56ms (1,777 ops/sec) | ML pipelines, numeric focus | | API Config | 5.32ms (188 ops/sec) | 0.59ms (1,685 ops/sec) | REST API responses | | Strict Config | 5.19ms (193 ops/sec) | 14.04ms (71 ops/sec) | Maximum type preservation | | Default | 7.26ms (138 ops/sec) | 0.58ms (1,737 ops/sec) | General use |
Key Performance Insights: - 7x faster DataFrame processing with Performance Config vs Default - 25x difference between Performance and Strict modes (speed vs fidelity trade-off) - Custom serializers: 3.4x faster than auto-detection (0.86ms vs 2.95ms) - Date formats: Unix timestamps fastest (3.11ms vs 5.16ms for custom formats) - NaN handling: NULL conversion fastest (2.83ms vs 3.10ms for keep original)
Memory Usage Optimization¶
- Performance Config: ~131KB serialized size
- Strict Config: ~149KB serialized size (+13% for maximum information retention)
- NaN Drop Config: ~135KB serialized size (clean data)
๐งช Testing & Quality Improvements¶
Comprehensive Test Suite Enhancement¶
- Added 39 new comprehensive test cases for configuration and type handling
- Improved test coverage: Maintained 83% overall coverage
- Enhanced test reliability: Reduced from 31 failing tests to 2 (test interaction issues only)
- Test performance: 298 tests passing (94.0% pass rate)
Configuration System Tests (tests/test_config_and_type_handlers.py
)¶
- Complete coverage of all configuration options and presets
- Advanced type handling validation for all 12+ new types
- Integration tests between configuration and type systems
- Performance regression tests for optimization validation
Pipeline Health¶
- All pre-commit hooks passing: Ruff linting, formatting, security checks
- All type checking passing: MyPy validation maintained
- Security testing: Bandit security scans clean
- Documentation validation: MkDocs strict mode passing
๐ Documentation Enhancements¶
Performance Documentation (docs/BENCHMARKING.md
)¶
- Comprehensive benchmark analysis with real performance measurements
- Configuration performance comparison with detailed recommendations
- Use case โ configuration mapping for optimization guidance
- Memory usage analysis and optimization strategies
- 7x performance improvement documentation with code examples
Feature Documentation¶
- Complete configuration guide (
docs/features/configuration/index.md
) with examples - Advanced types documentation (
docs/features/advanced-types/index.md
) with usage patterns - Enhanced feature index (
docs/features/index.md
) with hierarchical navigation - Product roadmap (
docs/ROADMAP.md
) with strategic feature planning
Enhanced Benchmark Suite (benchmarks/enhanced_benchmark_suite.py
)¶
- Comprehensive performance testing for configuration system impact
- Real-world data scenarios with statistical analysis
- Configuration preset comparison with operations per second metrics
- Memory usage measurement and optimization validation
๐ Backward Compatibility¶
Zero Breaking Changes¶
- All existing APIs preserved: No changes to public method signatures
- Configuration optional: Default behavior maintained for all existing code
- Import compatibility: All existing imports continue to work
- Behavioral consistency: Minor improvements in edge cases only
Smooth Migration Path¶
# Before v0.2.0 (still works exactly the same)
result = datason.serialize(data)
# After v0.2.0 (optional performance optimization)
from datason import serialize, get_performance_config
result = serialize(data, config=get_performance_config()) # 7x faster
๐ฏ Enhanced Developer Experience¶
Intelligent Configuration Selection¶
- Automatic optimization recommendations based on data types
- Performance profiling integration with benchmark suite
- Clear use case mapping for configuration selection
- Comprehensive examples for all configuration options
Advanced Type Support¶
- Seamless handling of complex Python objects
- Configurable type coercion for different fidelity requirements
- Extensible custom serializer system for domain-specific needs
- Rich pandas integration with multiple DataFrame orientations
Production-Ready Tooling¶
- Enterprise-grade configuration system with preset optimizations
- Comprehensive performance monitoring with benchmark suite
- Memory usage optimization with configuration-driven efficiency
- Professional documentation with real-world examples
๐ Bug Fixes¶
Test System Improvements¶
- Fixed test interaction issues in security limit testing (2 remaining, non-critical)
- Enhanced pandas integration with proper NaN/NaT handling in new configuration system
- Improved mock object handling in type detection system
- Better error handling for edge cases in complex type serialization
Serialization Behavior Improvements¶
- Consistent Series handling: Default to dict format (index โ value mapping) for better structure
- Enhanced NaN handling: Improved consistency with configurable NULL conversion
- Better complex object serialization: Structured output for debugging and inspection
- Improved type detection: More reliable handling of custom objects
โก Performance Optimizations¶
Configuration-Driven Optimization¶
- Smart default selection based on data characteristics
- Configurable performance vs fidelity trade-offs for different use cases
- Custom serializer caching for known object types
- Memory-efficient serialization with size-aware optimizations
Advanced Type Processing¶
- Optimized type detection with early exit strategies
- Efficient complex object handling with minimal overhead
- Batch processing optimizations for collections and arrays
- Reduced memory allocation in high-throughput scenarios
๐ฎ Foundation for Future Features¶
This release establishes a solid foundation for upcoming enhancements:
- v0.3.0: Typed deserialization with cast_to_template()
function
- v0.4.0: Redaction and privacy controls for sensitive data
- v0.5.0: Snapshot testing utilities for ML model validation
- v0.6.0: Delta-aware serialization for version control integration
๐ Migration Notes¶
For Existing Users: - No action required - all existing code continues to work unchanged - Optional optimization - add configuration parameter for performance gains - New capabilities - advanced types now serialize automatically
For New Users:
- Start with presets - use get_ml_config()
, get_api_config()
, etc.
- Profile your use case - run benchmark suite with your actual data
- Leverage advanced types - decimals, UUIDs, complex numbers now supported
๐ Achievements¶
- โ Zero breaking changes while adding major functionality
- โ 7x performance improvement with configuration optimization
- โ 12+ new data types supported with advanced type handling
- โ 83% test coverage maintained with 39 new comprehensive tests
- โ Enterprise-ready configuration system with 4 preset optimizations
- โ Comprehensive documentation with real performance measurements
- โ Production-proven with extensive benchmarking and validation
[0.1.4] - 2025-06-01¶
๐ Security & Testing Improvements¶
- Enhanced security test robustness
- Fixed flaky CI failures in security limit tests for dict and list size validation
- Added comprehensive diagnostics for debugging CI-specific import issues
- Improved exception handling with fallback SecurityError detection
- Enhanced depth limit testing with multi-approach validation (recursion limit + monkey-patching)
๐งช Test Infrastructure¶
- Dynamic environment-aware test configuration
- Smart CI vs local test parameter selection based on recursion limits
- Conservative CI limits (depth=250, size=50k) vs thorough local testing
- Added extensive diagnostics for import identity verification
- Robust fake object testing without memory exhaustion
๐ Bug Fixes¶
- Resolved CI-specific test failures
- Fixed SecurityError import inconsistencies in parallel test execution
- Eliminated flaky test behavior in GitHub Actions environment
- Improved exception type checking with isinstance() fallbacks
- Enhanced test reliability across different Python versions (3.11-3.12)
๐จโ๐ป Developer Experience¶
- Comprehensive test diagnostics
- Added detailed environment detection and reporting
- Enhanced error messages with module info and exception type analysis
- Improved debugging capabilities for CI environment differences
- Better test isolation and state management
[0.1.3] - 2025-05-31¶
๐ Major Changes¶
- BREAKING: Renamed package from
serialpy
todatason
- Updated all imports:
from serialpy
โfrom datason
- Updated package name in PyPI and documentation
- Maintained full API compatibility
๐ง DevOps Infrastructure Fixes¶
- Fixed GitHub Pages deployment permission errors
- Replaced deprecated
peaceiris/actions-gh-pages@v3
with modern GitHub Pages workflow - Updated to use official actions:
actions/configure-pages@v4
,actions/upload-pages-artifact@v3
,actions/deploy-pages@v4
- Added proper workflow-level permissions:
contents: read
,pages: write
,id-token: write
-
Added concurrency control and
workflow_dispatch
trigger -
Resolved Dependabot configuration issues
- Fixed missing labels error by adding 43+ comprehensive labels
- Fixed overlapping directories error by consolidating pip configurations
- Replaced deprecated reviewers field with
.github/CODEOWNERS
file -
Changed target branch from
develop
tomain
-
Fixed auto-merge workflow circular dependency
- Resolved infinite loop where auto-merge waited for itself
- Updated to
pull_request_target
event with proper permissions - Upgraded to
hmarr/auto-approve-action@v4
- Added explicit
pull-requests: write
permissions - Fixed "fatal: not a git repository" errors with proper checkout steps
- Updated to
fastify/github-action-merge-dependabot@v3
with correct configuration
๐ Improved Test Coverage & CI¶
- Enhanced Codecov integration
- Fixed coverage reporting from 45% to 86% by running all test files
- Added Codecov test results action with JUnit XML
- Added proper CODECOV_TOKEN usage and branch coverage tracking
-
Upload HTML coverage reports as artifacts
-
Fixed PyPI publishing workflow
- Updated package name references from 'pyjsonify' to 'datason'
- Fixed trusted publishing configuration
- Added comprehensive build verification and package checks
๐ Documentation Updates¶
- Updated contributing guidelines
- Replaced black/flake8 references with ruff
- Updated development workflow from 8 to 7 steps:
ruff check --fix . && ruff format .
-
Updated code style description to "Linter & Formatter: Ruff (unified tool)"
-
Enhanced repository configuration
- Added comprehensive repository description and topics
- Updated website links to reflect new package name
- Added comprehensive setup guide at
docs/GITHUB_PAGES_SETUP.md
๐ Security & Branch Protection¶
- Implemented smart branch protection
- Added GitHub repository rules requiring status checks and human approval
- Enabled auto-merge for repository with proper protection rules
- Configured branch protection to allow Dependabot bypass while maintaining security
๐ท๏ธ Release Management¶
- Professional release workflow
- Updated version management from "0.1.0" to "0.1.3"
- Implemented proper Git tagging and release automation
- Added comprehensive changelog documentation
- Fixed release date accuracy (2025-05-31)
๐ค Automation Features¶
- Comprehensive workflow management
- Smart Dependabot updates: conservative for ML libraries, aggressive for dev tools
- Complete labeling system with 43+ labels for categorization
- Automated reviewer assignment via CODEOWNERS
- Working auto-approval and auto-merge for safe dependency updates
๐ Bug Fixes¶
- Fixed auto-merge semver warning with proper
target: patch
parameter - Resolved GitHub Actions permission errors across all workflows
- Fixed environment validation errors in publishing workflows
- Corrected package import statements and references throughout codebase
โก Performance Improvements¶
- Optimized CI workflow to run all tests for complete coverage
- Enhanced build process with proper artifact management
- Improved development setup with updated tooling configuration
๐จโ๐ป Developer Experience¶
- Enhanced debugging with comprehensive logging and error reporting
- Improved development workflow with unified ruff configuration
- Better IDE support with updated type checking and linting
- Streamlined release process with automated tagging and publishing
[0.1.1] - 2025-05-30¶
Added¶
- Initial package structure and core serialization functionality
- Basic CI/CD pipeline setup
- Initial documentation framework