Idempotency Implementation Plan¶
Overview¶
This document outlines the strategy for adding idempotency to the datason core serialization and deserialization system. The goal is to prevent double serialization/deserialization while preserving all existing performance optimizations and battle-tested functionality.
Problem Statement¶
Currently, calling serialize()
on already-serialized data can cause:
1. Double serialization: {"__datason_type__": "dict", "__datason_value__": {...}}
becomes nested
2. Performance degradation: Unnecessary processing of already-processed data
3. Data corruption: Loss of original structure through multiple transformations
4. Memory bloat: Exponential growth of metadata structures
Core Principles¶
- Preserve existing architecture: All 8 layers of the current system must remain intact
- Minimal performance impact: Idempotency checks should be ultra-fast
- Battle-tested compatibility: All existing tests must pass
- Safe fallback: If detection fails, continue with normal processing
- Comprehensive coverage: Handle both serialization and deserialization
Implementation Strategy¶
Phase 1: Core Serialization (core.py → core_new.py)¶
Step 1: Reset and Copy¶
# Reset core.py to clean state from beginning of commit
git checkout HEAD~1 -- datason/core.py
# Copy to new implementation
cp datason/core.py datason/core_new.py
Step 2: Layer-by-Layer Idempotency Integration¶
Based on the core serialization strategy, add idempotency checks at strategic points:
Layer 1 (Ultra-Fast Path) - Lines 248-265 - No changes needed: Basic types (int, bool, None, short strings) are inherently idempotent - These types cannot contain serialization metadata
Layer 2 (Security Layer) - Lines 267-288
- Add primary idempotency check here
- Check for __datason_type__
and __datason_value__
keys
- Check for circular reference markers
- This catches most already-serialized data early
# Add after line 267
def _check_already_serialized(obj: Any) -> Optional[Any]:
"""Check if object is already serialized and return it if so."""
if isinstance(obj, dict):
# Check for type metadata
if "__datason_type__" in obj and "__datason_value__" in obj:
return obj
# Check for circular reference markers
if obj.get("__datason_type__") == "circular_reference":
return obj
# Check for redaction summaries
if "redaction_summary" in obj and isinstance(obj.get("redaction_summary"), dict):
return obj
elif isinstance(obj, (list, tuple)):
# Check for serialized list metadata
if len(obj) == 2 and isinstance(obj[0], str) and obj[0].startswith("__datason_"):
return obj
return None
Layer 4 (JSON-First Optimization) - Lines 306-340 - Add lightweight check: Before JSON compatibility detection - Quick scan for serialization markers in top-level structure
Layer 5 (Iterative Processing) - Lines 341-370 - Add item-level checks: Before processing each collection item - Prevent processing of already-serialized nested structures
Layer 7 (Hot Path Processing) - Lines 480-516 - Add container checks: Before processing small containers - Quick detection of serialized nested objects
Layer 8 (Full Processing Path) - Lines 520+ - Add comprehensive checks: Before complex type handling - Final safety net for any missed cases
Step 3: Performance Optimization¶
Caching Strategy:
# Add to existing caches
_SERIALIZATION_STATE_CACHE: Dict[int, str] = {} # Maps object id to state
_CACHE_SIZE_LIMIT = 1000
def _get_cached_serialization_state(obj: Any) -> Optional[str]:
"""Get cached serialization state: 'serialized', 'raw', or None."""
obj_id = id(obj)
if obj_id in _SERIALIZATION_STATE_CACHE:
return _SERIALIZATION_STATE_CACHE[obj_id]
# Determine state and cache if space available
if len(_SERIALIZATION_STATE_CACHE) < _CACHE_SIZE_LIMIT:
state = _detect_serialization_state(obj)
_SERIALIZATION_STATE_CACHE[obj_id] = state
return state
return None
Fast Detection Patterns:
def _detect_serialization_state(obj: Any) -> str:
"""Ultra-fast detection of serialization state."""
if isinstance(obj, dict):
# Check for metadata keys (most common pattern)
if "__datason_type__" in obj:
return "serialized"
# Check for other serialization markers
if any(key.startswith("__datason_") for key in obj.keys()):
return "serialized"
return "raw"
Phase 2: Core Deserialization (deserializers.py → deserializers_new.py)¶
Step 1: Copy and Analyze¶
Step 2: Add Idempotency to Main Functions¶
deserialize() function - Line 95:
def deserialize(obj: Any, parse_dates: bool = True, parse_uuids: bool = True) -> Any:
# Add idempotency check at the beginning
if _is_already_deserialized(obj):
return obj
# Continue with existing logic...
auto_deserialize() function - Line 140:
def auto_deserialize(obj: Any, aggressive: bool = False) -> Any:
# Add idempotency check
if _is_already_deserialized(obj):
return obj
# Continue with existing logic...
deserialize_fast() function - Line 1950:
def deserialize_fast(obj: Any, config: Optional["SerializationConfig"] = None,
_depth: int = 0, _seen: Optional[Set[int]] = None) -> Any:
# Add idempotency check in Phase 0
if _is_already_deserialized(obj):
return obj
# Continue with existing ultra-fast path...
Step 3: Deserialization State Detection¶
def _is_already_deserialized(obj: Any) -> bool:
"""Check if object appears to be already deserialized."""
# Raw Python objects that don't need deserialization
if isinstance(obj, (datetime, uuid.UUID, Path, Decimal)):
return True
# NumPy/Pandas objects
if np is not None and isinstance(obj, (np.ndarray, np.generic)):
return True
if pd is not None and isinstance(obj, (pd.DataFrame, pd.Series)):
return True
# Complex numbers, sets, tuples (non-JSON types)
if isinstance(obj, (complex, set, tuple)):
return True
# Check for deserialized containers
if isinstance(obj, (list, dict)):
return _contains_deserialized_objects(obj)
return False
def _contains_deserialized_objects(obj: Any, max_depth: int = 3) -> bool:
"""Check if container contains already-deserialized objects."""
if max_depth <= 0:
return False
if isinstance(obj, list):
return any(_is_already_deserialized(item) for item in obj[:5]) # Sample first 5
elif isinstance(obj, dict):
return any(_is_already_deserialized(v) for v in list(obj.values())[:5]) # Sample first 5
return False
Phase 3: Test Strategy¶
Step 1: Copy All Existing Tests¶
Create parallel test files that use the new implementations:
# Core serialization tests
cp tests/unit/test_core_comprehensive.py tests/unit/test_core_new_comprehensive.py
cp tests/edge_cases/test_core_edge_cases.py tests/edge_cases/test_core_edge_cases_new.py
# Deserialization tests
cp tests/unit/test_deserializers_comprehensive.py tests/unit/test_deserializers_new_comprehensive.py
cp tests/edge_cases/test_deserializers_edge_cases.py tests/edge_cases/test_deserializers_edge_cases_new.py
# Integration tests
cp tests/integration/test_round_trip.py tests/integration/test_round_trip_new.py
Step 2: Update Imports in New Test Files¶
Replace imports in all new test files:
# Old
import datason.core as core
from datason.deserializers import deserialize
# New
import datason.core_new as core
from datason.deserializers_new import deserialize
Step 3: Add Idempotency-Specific Tests¶
Create new test file: tests/unit/test_idempotency.py
class TestSerializationIdempotency:
def test_double_serialization_prevention(self):
"""Test that serializing already-serialized data returns unchanged."""
def test_nested_serialized_data(self):
"""Test handling of nested already-serialized structures."""
def test_mixed_serialized_raw_data(self):
"""Test containers with mix of serialized and raw data."""
class TestDeserializationIdempotency:
def test_double_deserialization_prevention(self):
"""Test that deserializing already-deserialized data returns unchanged."""
def test_nested_deserialized_data(self):
"""Test handling of nested already-deserialized structures."""
Step 4: Performance Regression Tests¶
Create tests/performance/test_idempotency_performance.py
:
class TestIdempotencyPerformance:
def test_serialization_performance_impact(self):
"""Ensure idempotency checks don't significantly impact performance."""
def test_deserialization_performance_impact(self):
"""Ensure idempotency checks don't significantly impact performance."""
def test_cache_effectiveness(self):
"""Test that caching improves repeated operations."""
Phase 4: Validation and Testing¶
Step 1: Run All New Tests¶
# Test new core implementation
pytest tests/unit/test_core_new_comprehensive.py -v
pytest tests/edge_cases/test_core_edge_cases_new.py -v
# Test new deserializers implementation
pytest tests/unit/test_deserializers_new_comprehensive.py -v
pytest tests/edge_cases/test_deserializers_edge_cases_new.py -v
# Test integration
pytest tests/integration/test_round_trip_new.py -v
# Test idempotency specifically
pytest tests/unit/test_idempotency.py -v
# Performance validation
pytest tests/performance/test_idempotency_performance.py -v
Step 2: Benchmark Comparison¶
# Compare performance before/after
python benchmarks/compare_implementations.py --old=core --new=core_new
python benchmarks/compare_implementations.py --old=deserializers --new=deserializers_new
Step 3: Coverage Analysis¶
# Ensure coverage doesn't decrease
pytest --cov=datason.core_new --cov=datason.deserializers_new --cov-report=html
Phase 5: Integration and Deployment¶
Step 1: Validate All Tests Pass¶
- All existing functionality preserved
- All new idempotency tests pass
- Performance within acceptable bounds (< 5% regression)
- Coverage maintained or improved
Step 2: Replace Original Files¶
# Only after all tests pass
mv datason/core.py datason/core_backup.py
mv datason/core_new.py datason/core.py
mv datason/deserializers.py datason/deserializers_backup.py
mv datason/deserializers_new.py datason/deserializers.py
Step 3: Update All Test Files¶
# Update imports back to original names
sed -i 's/core_new/core/g' tests/unit/test_core_new_comprehensive.py
sed -i 's/deserializers_new/deserializers/g' tests/unit/test_deserializers_new_comprehensive.py
# ... etc for all test files
Step 4: Final Validation¶
Risk Mitigation¶
Potential Issues and Solutions¶
- Performance Regression
- Risk: Idempotency checks slow down hot paths
- Mitigation: Aggressive caching, ultra-fast detection patterns
-
Fallback: Feature flag to disable idempotency checks
-
False Positives
- Risk: Incorrectly identifying raw data as serialized
- Mitigation: Conservative detection patterns, comprehensive testing
-
Fallback: Continue with normal processing if detection uncertain
-
Memory Leaks
- Risk: Caches grow unbounded
- Mitigation: Cache size limits, periodic cleanup
-
Fallback: Disable caching if memory pressure detected
-
Edge Case Breakage
- Risk: Unusual data patterns break idempotency logic
- Mitigation: Extensive edge case testing, safe fallbacks
- Fallback: Bypass idempotency for problematic patterns
Rollback Plan¶
If issues are discovered after deployment:
- Immediate: Revert to backup files
- Short-term: Disable idempotency via feature flag
- Long-term: Fix issues and re-deploy with additional testing
Success Criteria¶
Functional Requirements¶
- All existing tests pass with new implementation
- Double serialization prevented in all test cases
- Double deserialization prevented in all test cases
- Round-trip serialization/deserialization works correctly
- All edge cases handled properly
Performance Requirements¶
- < 5% performance regression on common data types
- < 10% performance regression on complex data types
- Idempotency checks complete in < 100ns for cached cases
- Memory usage increase < 10% for typical workloads
Quality Requirements¶
- Code coverage maintained or improved
- All linting checks pass
- Documentation updated
- No security vulnerabilities introduced
Timeline¶
Week 1: Core Serialization¶
- Day 1-2: Reset, copy, and implement Layer 2 idempotency
- Day 3-4: Implement remaining layer checks
- Day 5: Performance optimization and caching
Week 2: Deserialization¶
- Day 1-2: Copy and implement deserialization idempotency
- Day 3-4: Add detection patterns and optimization
- Day 5: Integration testing
Week 3: Testing and Validation¶
- Day 1-2: Copy all tests and update imports
- Day 3-4: Create idempotency-specific tests
- Day 5: Performance benchmarking
Week 4: Integration and Deployment¶
- Day 1-2: Final validation and bug fixes
- Day 3: Replace original files
- Day 4-5: Final testing and documentation
Conclusion¶
This plan provides a systematic approach to adding idempotency while preserving the battle-tested performance optimizations and comprehensive functionality of the existing system. The layered approach ensures minimal risk and maximum compatibility.
The key to success is: 1. Incremental implementation - One layer at a time 2. Comprehensive testing - Parallel test suites ensure nothing breaks 3. Performance focus - Aggressive optimization of idempotency checks 4. Safe fallbacks - Always continue processing if detection fails 5. Careful validation - Multiple checkpoints before deployment