Pandas Integration¶
Deep integration with the pandas ecosystem for data science workflows, including DataFrames, Series, Index types, and specialized pandas objects.
๐ฏ Overview¶
datason provides comprehensive support for pandas objects:
- DataFrames: Configurable orientation (records, split, index, columns, values, table)
- Series: Index preservation and metadata handling
- Index Types: RangeIndex, DatetimeIndex, MultiIndex, CategoricalIndex
- Categorical: Category metadata and ordering preservation
- NaN Handling: Configurable strategies for missing data
- Performance: Optimized serialization for large datasets
๐ DataFrame Serialization¶
Orientation Options¶
DataFrames can be serialized in different orientations based on your use case:
import pandas as pd
import datason
from datason import SerializationConfig, DataFrameOrient
# Sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['x', 'y', 'z'],
'C': [10.1, 20.2, 30.3]
})
# Records orientation (default) - intuitive for APIs
config = SerializationConfig(dataframe_orient=DataFrameOrient.RECORDS)
result = datason.serialize(df, config=config)
# Output: [{"A": 1, "B": "x", "C": 10.1}, {"A": 2, "B": "y", "C": 20.2}, ...]
# Split orientation - efficient for large datasets
config = SerializationConfig(dataframe_orient=DataFrameOrient.SPLIT)
result = datason.serialize(df, config=config)
# Output: {"columns": ["A", "B", "C"], "data": [[1, "x", 10.1], [2, "y", 20.2], ...]}
# Values orientation - array-like data only
config = SerializationConfig(dataframe_orient=DataFrameOrient.VALUES)
result = datason.serialize(df, config=config)
# Output: [[1, "x", 10.1], [2, "y", 20.2], [3, "z", 30.3]]
# Index orientation - preserves row labels
config = SerializationConfig(dataframe_orient=DataFrameOrient.INDEX)
result = datason.serialize(df, config=config)
# Output: {"0": {"A": 1, "B": "x", "C": 10.1}, "1": {"A": 2, "B": "y", "C": 20.2}, ...}
# Columns orientation - column-wise data
config = SerializationConfig(dataframe_orient=DataFrameOrient.COLUMNS)
result = datason.serialize(df, config=config)
# Output: {"A": [1, 2, 3], "B": ["x", "y", "z"], "C": [10.1, 20.2, 30.3]}
Performance Comparison¶
Based on benchmarking results:
Orientation | Small DataFrames (<1K rows) | Large DataFrames (>1K rows) | Best For |
---|---|---|---|
Values | 0.07ms (fastest) | 2.42ms | Array-like processing |
Split | 0.21ms | 1.63ms (fastest) | Large datasets, efficiency |
Records | 0.24ms | 2.48ms | APIs, intuitive structure |
Columns | 0.25ms | 2.45ms | Column-wise operations |
Index | 0.30ms | 2.80ms | Row-labeled data |
๐ Series Handling¶
Basic Series Serialization¶
import pandas as pd
import datason
# Simple series
series = pd.Series([1, 2, 3, 4, 5], name='values')
result = datason.serialize(series)
# Output: {"type": "pandas.Series", "data": [1, 2, 3, 4, 5], "name": "values", "index": [0, 1, 2, 3, 4]}
# Series with custom index
series = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='measurements')
result = datason.serialize(series)
# Output: {"type": "pandas.Series", "data": [10, 20, 30], "name": "measurements", "index": ["a", "b", "c"]}
Series with DatetimeIndex¶
# Time series data
dates = pd.date_range('2024-01-01', periods=3, freq='D')
ts = pd.Series([100, 105, 110], index=dates, name='stock_price')
result = datason.serialize(ts)
# Preserves datetime index information
๐ท๏ธ Index Types¶
RangeIndex¶
# Default integer index
df = pd.DataFrame({'values': [1, 2, 3]})
result = datason.serialize(df)
# RangeIndex is efficiently represented
DatetimeIndex¶
# Time-based index
dates = pd.date_range('2024-01-01', periods=5, freq='D')
df = pd.DataFrame({'values': range(5)}, index=dates)
result = datason.serialize(df)
# Datetime index preserved with timezone information
MultiIndex¶
# Hierarchical index
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'values': [10, 20, 30, 40]}, index=index)
result = datason.serialize(df)
# MultiIndex structure and names preserved
CategoricalIndex¶
# Categorical index
categories = pd.CategoricalIndex(['small', 'medium', 'large'], ordered=True)
df = pd.DataFrame({'count': [10, 20, 30]}, index=categories)
result = datason.serialize(df)
# Category order and metadata preserved
๐ข Categorical Data¶
Basic Categorical Handling¶
# Categorical columns
df = pd.DataFrame({
'grade': pd.Categorical(['A', 'B', 'A', 'C'], ordered=True),
'size': pd.Categorical(['S', 'M', 'L', 'M'], categories=['S', 'M', 'L', 'XL'])
})
result = datason.serialize(df)
# Categories, ordering, and unused categories preserved
Categorical with Custom Ordering¶
# Custom category order
priority = pd.Categorical(['high', 'low', 'medium', 'high'],
categories=['low', 'medium', 'high'],
ordered=True)
df = pd.DataFrame({'priority': priority, 'task_id': range(4)})
result = datason.serialize(df)
# Custom ordering preserved for proper reconstruction
๐๏ธ Missing Data Handling¶
NaN Strategies¶
from datason import SerializationConfig, NanHandling
# DataFrame with missing data
df = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [1.1, None, 3.3, 4.4],
'C': ['x', 'y', None, 'z']
})
# Convert to null (default)
config = SerializationConfig(nan_handling=NanHandling.NULL)
result = datason.serialize(df, config=config)
# None/NaN โ null in JSON
# Convert to string representation
config = SerializationConfig(nan_handling=NanHandling.STRING)
result = datason.serialize(df, config=config)
# None/NaN โ "NaN" string
# Drop missing values
config = SerializationConfig(nan_handling=NanHandling.DROP)
result = datason.serialize(df, config=config)
# Rows/columns with missing data removed
# Keep original (preserve NaN as special value)
config = SerializationConfig(nan_handling=NanHandling.PRESERVE)
result = datason.serialize(df, config=config)
# Special encoding to preserve exact NaN semantics
โก Performance Optimization¶
Large DataFrame Handling¶
# For large DataFrames, use split orientation
large_df = pd.DataFrame(np.random.randn(10000, 50))
config = datason.get_performance_config()
# Automatically uses split orientation for large DataFrames
result = datason.serialize(large_df, config=config)
Memory-Efficient Serialization¶
# Chunked processing for very large datasets
def serialize_large_dataframe(df, chunk_size=1000):
"""Serialize large DataFrame in chunks."""
chunks = []
for i in range(0, len(df), chunk_size):
chunk = df.iloc[i:i+chunk_size]
chunks.append(datason.serialize(chunk))
return {"chunks": chunks, "total_rows": len(df)}
Optimized Data Types¶
# Use efficient data types before serialization
df = df.astype({
'category_col': 'category', # Use categorical for repeated strings
'int_col': 'int32', # Use smaller int types when possible
'float_col': 'float32' # Use float32 instead of float64 when precision allows
})
result = datason.serialize(df)
๐ Deserialization¶
Reconstructing DataFrames¶
from datason import deserialize
# Deserialize back to DataFrame
serialized_df = datason.serialize(df)
reconstructed_df = deserialize(serialized_df)
# Verify reconstruction
assert df.equals(reconstructed_df)
assert df.dtypes.equals(reconstructed_df.dtypes)
assert df.index.equals(reconstructed_df.index)
Handling Type Information¶
# Automatic type reconstruction
original_df = pd.DataFrame({
'dates': pd.date_range('2024-01-01', periods=3),
'categories': pd.Categorical(['A', 'B', 'A']),
'numbers': [1, 2, 3],
'floats': [1.1, 2.2, 3.3]
})
# Serialize and deserialize
result = datason.serialize(original_df)
restored_df = datason.deserialize(result)
# All dtypes preserved automatically
print(original_df.dtypes)
print(restored_df.dtypes)
# Output should be identical
๐งช Advanced Use Cases¶
Mixed Data Types¶
# DataFrame with complex mixed types
complex_df = pd.DataFrame({
'id': range(3),
'timestamp': pd.date_range('2024-01-01', periods=3),
'data': [{'key': 'value'}, [1, 2, 3], 'simple_string'],
'numbers': np.array([1.1, 2.2, 3.3]),
'categories': pd.Categorical(['type1', 'type2', 'type1'])
})
result = datason.serialize(complex_df)
# All types handled appropriately
Sparse DataFrames¶
# Sparse data handling
sparse_df = pd.DataFrame({
'A': pd.arrays.SparseArray([0, 0, 1, 0, 1]),
'B': pd.arrays.SparseArray([0, 1, 0, 0, 0]),
'C': range(5)
})
result = datason.serialize(sparse_df)
# Sparse arrays converted to dense for JSON compatibility
DataFrame with Custom Objects¶
# DataFrame containing custom objects
class CustomObject:
def __init__(self, value):
self.value = value
def __repr__(self):
return f"CustomObject({self.value})"
df = pd.DataFrame({
'id': [1, 2, 3],
'custom': [CustomObject(10), CustomObject(20), CustomObject(30)]
})
result = datason.serialize(df)
# Custom objects handled with fallback serialization
๐ง Configuration Examples¶
API Response Format¶
# Optimized for API responses
config = datason.get_api_config()
api_df = pd.DataFrame({
'user_id': [1, 2, 3],
'created_at': pd.date_range('2024-01-01', periods=3),
'score': [85.5, 92.1, 78.3]
})
result = datason.serialize(api_df, config=config)
# Uses records orientation with ISO date format
Database Export Format¶
# Optimized for database import
config = SerializationConfig(
dataframe_orient=DataFrameOrient.SPLIT,
date_format=DateFormat.UNIX,
nan_handling=NanHandling.NULL
)
result = datason.serialize(df, config=config)
# Efficient format with numeric timestamps
Analytics Format¶
# Optimized for analytics tools
config = SerializationConfig(
dataframe_orient=DataFrameOrient.COLUMNS,
preserve_categorical_ordering=True,
include_dtype_info=True
)
result = datason.serialize(analytics_df, config=config)
# Column-wise data with full type preservation
๐ Best Practices¶
1. Choose the Right Orientation¶
# For APIs with row-based processing
config = SerializationConfig(dataframe_orient=DataFrameOrient.RECORDS)
# For large datasets and performance
config = SerializationConfig(dataframe_orient=DataFrameOrient.SPLIT)
# For column-wise analytics
config = SerializationConfig(dataframe_orient=DataFrameOrient.COLUMNS)
2. Handle Missing Data Consistently¶
# Be explicit about NaN handling
config = SerializationConfig(nan_handling=NanHandling.NULL)
# Ensures consistent behavior across different data sources
3. Preserve Important Metadata¶
# Include index information when important
df_with_meaningful_index = pd.DataFrame(
{'values': [1, 2, 3]},
index=['row1', 'row2', 'row3']
)
# Index will be preserved automatically
4. Optimize Data Types¶
# Use appropriate data types for efficiency
df = df.astype({
'category_column': 'category',
'small_int_column': 'int8',
'date_column': 'datetime64[ns]'
})
๐ Related Features¶
- Configuration System - DataFrame-specific configurations
- Date/Time Handling - Temporal data in DataFrames
- Advanced Types - Complex data within DataFrames
- Performance - Large DataFrame optimization
๐ Next Steps¶
- Configuration โ - Customize DataFrame serialization
- Performance โ - Optimize for large datasets
- ML/AI Integration โ - Use DataFrames in ML workflows