📊 Chunked & Streaming Processing¶
Functions for handling large datasets efficiently with memory management.
🎯 Overview¶
Chunked and streaming functions allow processing of large datasets that don't fit in memory.
📦 Functions¶
serialize_chunked()¶
datason.serialize_chunked(obj: Any, chunk_size: int = 1000, config: Optional[SerializationConfig] = None, memory_limit_mb: Optional[int] = None) -> ChunkedSerializationResult
¶
Serialize large objects in memory-bounded chunks.
This function breaks large objects (lists, DataFrames, arrays) into smaller chunks to enable processing of datasets larger than available memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Any
|
Object to serialize (typically list, DataFrame, or array) |
required |
chunk_size
|
int
|
Number of items per chunk |
1000
|
config
|
Optional[SerializationConfig]
|
Serialization configuration |
None
|
memory_limit_mb
|
Optional[int]
|
Optional memory limit in MB (not enforced yet, for future use) |
None
|
Returns:
Type | Description |
---|---|
ChunkedSerializationResult
|
ChunkedSerializationResult with iterator of serialized chunks |
Examples:
>>> large_list = list(range(10000))
>>> result = serialize_chunked(large_list, chunk_size=100)
>>> chunks = result.to_list() # Get all chunks
>>> len(chunks) # 100 chunks of 100 items each
100
>>> # Save directly to file without loading all chunks
>>> result = serialize_chunked(large_data, chunk_size=1000)
>>> result.save_to_file("large_data.jsonl", format="jsonl")
Source code in datason/core_new.py
ChunkedSerializationResult¶
datason.ChunkedSerializationResult(chunks: Iterator[Any], metadata: Dict[str, Any])
¶
Result container for chunked serialization operations.
Initialize chunked result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
Iterator[Any]
|
Iterator of serialized chunks |
required |
metadata
|
Dict[str, Any]
|
Metadata about the chunking operation |
required |
Source code in datason/core_new.py
save_to_file(file_path: Union[str, Path], format: str = 'jsonl') -> None
¶
Save chunks to a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Union[str, Path]
|
Path to save the chunks |
required |
format
|
str
|
Format to save ('jsonl' for JSON lines, 'json' for array) |
'jsonl'
|
Source code in datason/core_new.py
StreamingSerializer¶
datason.StreamingSerializer(file_path: Union[str, Path], config: Optional[SerializationConfig] = None, format: str = 'jsonl', buffer_size: int = 8192)
¶
Context manager for streaming serialization to files.
Enables processing of datasets larger than available memory by writing serialized data directly to files without keeping everything in memory.
Initialize streaming serializer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Union[str, Path]
|
Path to output file |
required |
config
|
Optional[SerializationConfig]
|
Serialization configuration |
None
|
format
|
str
|
Output format ('jsonl' or 'json') |
'jsonl'
|
buffer_size
|
int
|
Write buffer size in bytes |
8192
|
Source code in datason/core_new.py
__enter__() -> StreamingSerializer
¶
Enter context manager.
Source code in datason/core_new.py
__exit__(exc_type: Any, exc_val: Any, exc_tb: Any) -> None
¶
Exit context manager.
Source code in datason/core_new.py
write(obj: Any) -> None
¶
Write a single object to the stream.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Any
|
Object to serialize and write |
required |
Source code in datason/core_new.py
write_chunked(obj: Any, chunk_size: int = 1000) -> None
¶
Write a large object using chunked serialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Any
|
Large object to chunk and write |
required |
chunk_size
|
int
|
Size of each chunk |
1000
|
Source code in datason/core_new.py
estimate_memory_usage()¶
datason.estimate_memory_usage(obj: Any, config: Optional[SerializationConfig] = None) -> Dict[str, Any]
¶
Estimate memory usage for serializing an object.
This is a rough estimation to help users decide on chunking strategies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Any
|
Object to analyze |
required |
config
|
Optional[SerializationConfig]
|
Serialization configuration |
None
|
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary with memory usage estimates |
Examples:
>>> import pandas as pd
>>> df = pd.DataFrame({'a': range(10000), 'b': range(10000)})
>>> stats = estimate_memory_usage(df)
>>> print(f"Estimated serialized size: {stats['estimated_serialized_mb']:.1f} MB")
>>> print(f"Recommended chunk size: {stats['recommended_chunk_size']}")
Source code in datason/core_new.py
🔗 Related Documentation¶
- Core Functions - Basic serialization functions
- Modern Serialization - Modern chunked functions