File Operations Guide¶
Datason provides comprehensive file operations for saving and loading data in both JSON and JSONL formats, with full integration of all features including ML optimization, security, streaming, and compression.
Quick Start¶
import datason
import numpy as np
import pandas as pd
# Save ML data to JSONL
ml_data = {
"model_weights": np.random.randn(100, 50),
"training_data": pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}),
"metrics": {"accuracy": 0.95}
}
# Save with ML optimization
datason.save_ml(ml_data, "model.jsonl.gz") # Compressed JSONL
# Load back with smart reconstruction
loaded = list(datason.load_smart_file("model.jsonl.gz"))
Formats Supported¶
JSON Format (.json)¶
- Best for: Single objects, configuration files, metadata
- Structure: Single JSON object or array per file
- Use when: You have a complete dataset as one unit
# Save single object
config = {"learning_rate": 0.001, "batch_size": 32}
datason.save_ml(config, "config.json")
# Save array of experiments
experiments = [{"id": 1, "acc": 0.9}, {"id": 2, "acc": 0.85}]
datason.save_ml(experiments, "experiments.json")
JSONL Format (.jsonl)¶
- Best for: Streaming data, logs, large datasets, line-by-line processing
- Structure: One JSON object per line
- Use when: You want to stream, append, or process data incrementally
# Save to JSONL (each item on separate line)
training_logs = [
{"epoch": 1, "loss": 0.8, "weights": np.random.randn(10)},
{"epoch": 2, "loss": 0.6, "weights": np.random.randn(10)},
{"epoch": 3, "loss": 0.4, "weights": np.random.randn(10)}
]
datason.save_ml(training_logs, "training.jsonl")
Core Functions¶
Save Functions¶
save_ml(data, filepath, format=None)
¶
Optimized for ML workflows with automatic type preservation.
# Automatic format detection from extension
datason.save_ml(ml_data, "model.jsonl") # JSONL format
datason.save_ml(ml_data, "model.json") # JSON format
datason.save_ml(ml_data, "model.jsonl.gz") # Compressed JSONL
# Explicit format override
datason.save_ml(ml_data, "model.txt", format="jsonl")
save_secure(data, filepath, redact_pii=False, redact_fields=None, format=None)
¶
Save with security features - PII redaction and data protection.
sensitive_data = {
"user": {
"name": "John Doe",
"email": "john@company.com",
"ssn": "123-45-6789"
},
"api_key": "sk-1234567890"
}
# Automatic PII detection and redaction
datason.save_secure(
sensitive_data,
"secure_data.jsonl",
redact_pii=True,
redact_fields=["api_key"]
)
save_api(data, filepath, format=None)
¶
Clean API data without ML optimizations.
api_response = {"status": "success", "data": [1, 2, 3]}
datason.save_api(api_response, "api_data.json")
save_chunked(data, filepath, chunk_size=1000, format=None)
¶
Save large datasets in chunks for memory efficiency.
large_dataset = [{"record": i} for i in range(100000)]
datason.save_chunked(large_dataset, "large.jsonl", chunk_size=5000)
Load Functions¶
load_smart_file(filepath, format=None)
¶
Intelligent loading with automatic type reconstruction.
# Load with smart type detection
data = list(datason.load_smart_file("model.jsonl"))
# Force specific format
data = list(datason.load_smart_file("data.txt", format="jsonl"))
load_perfect_file(filepath, template, format=None)
¶
Perfect type reconstruction using templates.
# Define template for perfect reconstruction
template = {
"weights": np.array([[0.0]]), # NumPy array template
"dataframe": df.iloc[:1], # DataFrame template
"metadata": {} # Regular dict
}
# Load with perfect type reconstruction
data = list(datason.load_perfect_file("model.jsonl", template))
load_basic_file(filepath, format=None)
¶
Basic loading without type reconstruction.
Streaming Functions¶
stream_save_ml(filepath, format=None)
¶
Stream data for memory-efficient processing of large datasets.
# Stream training checkpoints
with datason.stream_save_ml("checkpoints.jsonl") as stream:
for epoch in range(1000):
checkpoint = {
"epoch": epoch,
"weights": model.get_weights(),
"metrics": get_epoch_metrics(epoch)
}
stream.write(checkpoint)
Advanced Features¶
Compression¶
Automatic compression with .gz
extension:
# These are automatically compressed
datason.save_ml(large_data, "data.jsonl.gz")
datason.save_ml(large_data, "data.json.gz")
# Load compressed files transparently
data = list(datason.load_smart_file("data.jsonl.gz"))
Format Conversion¶
Easy conversion between JSON and JSONL:
# Load JSONL and save as JSON
jsonl_data = list(datason.load_smart_file("experiments.jsonl"))
datason.save_ml(jsonl_data, "experiments.json", format="json")
# Load JSON and save as JSONL
json_data = list(datason.load_smart_file("config.json"))
datason.save_ml(json_data, "config.jsonl", format="jsonl")
Security and Redaction¶
Automatic PII detection and redaction:
customer_data = {
"customers": [
{
"name": "Alice Johnson",
"email": "alice@company.com",
"ssn": "123-45-6789",
"credit_card": "4532-1234-5678-9012"
}
],
"secrets": {
"api_key": "sk-secret123",
"database_url": "postgresql://user:pass@db/prod"
}
}
# Comprehensive redaction
datason.save_secure(
customer_data,
"customers.jsonl",
redact_pii=True, # Auto-detect SSN, credit cards, etc.
redact_fields=["api_key", "database_url"] # Explicit field redaction
)
# Load back with redaction metadata
secure_data = list(datason.load_smart_file("customers.jsonl"))[0]
print(f"Redacted {secure_data['redaction_summary']['total_redactions']} items")
ML Workflows¶
Model Training Pipeline¶
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
# 1. Prepare training data
X, y = make_classification(n_samples=1000, n_features=20)
feature_info = pd.DataFrame({
"name": [f"feature_{i}" for i in range(20)],
"importance": np.random.random(20)
})
# 2. Train model
model = RandomForestClassifier()
model.fit(X, y)
# 3. Package everything
ml_package = {
"model": model,
"training_data": {"X": X, "y": y, "features": feature_info},
"metadata": {"accuracy": model.score(X, y), "timestamp": datetime.now()}
}
# 4. Save with compression
datason.save_ml(ml_package, "trained_model.jsonl.gz")
Streaming Training Logs¶
# Stream training progress
with datason.stream_save_ml("training_log.jsonl") as stream:
for epoch in range(100):
# Training step
train_loss = train_one_epoch(model, train_loader)
val_loss = validate(model, val_loader)
# Log progress
log_entry = {
"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"learning_rate": scheduler.get_lr()[0],
"timestamp": datetime.now(),
"model_weights_sample": model.layers[0].weight.data[:5].numpy()
}
stream.write(log_entry)
# Later: analyze training progress
logs = list(datason.load_smart_file("training_log.jsonl"))
train_losses = [log["train_loss"] for log in logs]
Experiment Tracking¶
# Track multiple experiments
experiments = []
for lr in [0.01, 0.001, 0.0001]:
for batch_size in [32, 64, 128]:
# Run experiment
config = {"lr": lr, "batch_size": batch_size}
model = train_model(config)
results = evaluate_model(model, test_data)
experiments.append({
"config": config,
"results": results,
"model_weights": model.get_weights(),
"timestamp": datetime.now()
})
# Save all experiments
datason.save_ml(experiments, "experiments.jsonl.gz")
# Later: find best experiment
all_experiments = list(datason.load_smart_file("experiments.jsonl.gz"))
best = max(all_experiments, key=lambda x: x["results"]["accuracy"])
Data Types Supported¶
NumPy Arrays¶
Perfect preservation of shape, dtype, and data:
data = {
"weights": np.random.randn(100, 50).astype(np.float32),
"labels": np.random.randint(0, 10, 1000).astype(np.int64),
"mask": np.random.choice([True, False], 1000)
}
datason.save_ml(data, "arrays.jsonl")
loaded = list(datason.load_smart_file("arrays.jsonl"))[0]
assert isinstance(loaded["weights"], np.ndarray)
assert loaded["weights"].dtype == np.float32
assert loaded["weights"].shape == (100, 50)
Pandas DataFrames¶
Complete preservation including dtypes and index:
df = pd.DataFrame({
"id": range(1000),
"timestamp": pd.date_range("2024-01-01", periods=1000, freq="H"),
"value": np.random.random(1000),
"category": pd.Categorical(np.random.choice(["A", "B", "C"], 1000))
})
datason.save_ml({"dataframe": df}, "dataframe.jsonl")
loaded = list(datason.load_smart_file("dataframe.jsonl"))[0]
assert isinstance(loaded["dataframe"], pd.DataFrame)
assert len(loaded["dataframe"]) == 1000
assert "timestamp" in loaded["dataframe"].columns
Complex Nested Structures¶
Handles arbitrarily nested data:
complex_data = {
"neural_net": {
"layers": [
{"type": "dense", "weights": np.random.randn(784, 128)},
{"type": "dense", "weights": np.random.randn(128, 10)}
],
"optimizer_state": {
"momentum": [np.random.randn(784, 128), np.random.randn(128, 10)],
"learning_rate": 0.001
}
},
"training_history": pd.DataFrame({
"epoch": range(50),
"loss": np.random.exponential(1, 50)
})
}
Performance Tips¶
When to Use Each Format¶
Use Case | Recommended Format | Reason |
---|---|---|
Model checkpoints | .jsonl.gz |
Good compression, line-by-line structure |
Configuration files | .json |
Single object, human readable |
Training logs | .jsonl |
Streamable, appendable |
Large datasets | .jsonl.gz |
Memory efficient, compressed |
API responses | .json |
Standard format |
Experiment results | .jsonl |
One experiment per line |
Memory Efficiency¶
For large datasets, use streaming:
# Instead of loading everything into memory
big_data = [huge_record for huge_record in massive_dataset] # ❌ Memory intensive
datason.save_ml(big_data, "huge.jsonl")
# Use streaming instead
with datason.stream_save_ml("huge.jsonl") as stream: # ✅ Memory efficient
for record in massive_dataset:
stream.write(record)
Compression Benefits¶
Compression ratios vary by data type:
- Text data: 5-10x compression
- Repeated values: 20-50x compression
- Numerical arrays: 2-5x compression
- Mixed data: 3-8x compression
# Always use compression for large files
datason.save_ml(large_data, "data.jsonl.gz") # ✅ Compressed
datason.save_ml(large_data, "data.jsonl") # ❌ Uncompressed
Best Practices¶
File Naming Conventions¶
# Good naming patterns
"model_v1.2.3.jsonl.gz" # Versioned model
"training_logs_2024-01-15.jsonl" # Dated logs
"experiments_lr_search.jsonl" # Descriptive purpose
"customer_data_secure.jsonl" # Security indication
# Use consistent extensions
".jsonl" or ".jsonl.gz" # For line-by-line data
".json" or ".json.gz" # For single objects
Directory Structure¶
ml_project/
├── models/
│ ├── checkpoints/
│ │ ├── epoch_001.jsonl.gz
│ │ ├── epoch_002.jsonl.gz
│ │ └── final_model.jsonl.gz
│ └── experiments/
│ ├── hyperparameter_search.jsonl
│ └── architecture_comparison.jsonl
├── data/
│ ├── training_data.jsonl.gz
│ ├── validation_data.jsonl.gz
│ └── features.json
└── logs/
├── training_2024-01-15.jsonl
└── evaluation_2024-01-15.jsonl
Error Handling¶
try:
data = list(datason.load_smart_file("model.jsonl"))
except FileNotFoundError:
print("Model file not found")
except json.JSONDecodeError:
print("Corrupted file")
except Exception as e:
print(f"Unexpected error: {e}")
API Reference¶
Function Signatures¶
# Save functions
save_ml(data, filepath, format=None)
save_secure(data, filepath, redact_pii=False, redact_fields=None, format=None)
save_api(data, filepath, format=None)
save_chunked(data, filepath, chunk_size=1000, format=None)
# Load functions
load_smart_file(filepath, format=None) -> Iterator
load_perfect_file(filepath, template, format=None) -> Iterator
load_basic_file(filepath, format=None) -> Iterator
# Streaming
stream_save_ml(filepath, format=None) -> ContextManager
Parameters¶
data
: Data to save (any serializable object)filepath
: Path to save/load (str or Path object)format
: Force format ("json" or "jsonl", auto-detected if None)redact_pii
: Enable automatic PII detection and redactionredact_fields
: List of field names to redacttemplate
: Template object for perfect type reconstructionchunk_size
: Number of records per chunk for memory efficiency
Integration with Other Features¶
File operations integrate seamlessly with all datason features:
- ✅ ML Type Handlers: Automatic preservation of ML objects
- ✅ Security & Redaction: PII protection and field redaction
- ✅ Streaming: Memory-efficient processing
- ✅ Compression: Automatic
.gz
detection - ✅ Templates: Perfect type reconstruction
- ✅ Progressive Complexity: basic/smart/perfect loading options
This makes file operations a true first-class citizen in the datason ecosystem!