File Operations Guide¶

Datason provides comprehensive file operations for saving and loading data in both JSON and JSONL formats, with full integration of all features including ML optimization, security, streaming, and compression.

Quick Start¶

import datason
import numpy as np
import pandas as pd

# Save ML data to JSONL
ml_data = {
    "model_weights": np.random.randn(100, 50),
    "training_data": pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}),
    "metrics": {"accuracy": 0.95}
}

# Save with ML optimization
datason.save_ml(ml_data, "model.jsonl.gz")  # Compressed JSONL

# Load back with smart reconstruction
loaded = list(datason.load_smart_file("model.jsonl.gz"))

Formats Supported¶

JSON Format (.json)¶

Best for: Single objects, configuration files, metadata
Structure: Single JSON object or array per file
Use when: You have a complete dataset as one unit

# Save single object
config = {"learning_rate": 0.001, "batch_size": 32}
datason.save_ml(config, "config.json")

# Save array of experiments  
experiments = [{"id": 1, "acc": 0.9}, {"id": 2, "acc": 0.85}]
datason.save_ml(experiments, "experiments.json")

JSONL Format (.jsonl)¶

Best for: Streaming data, logs, large datasets, line-by-line processing
Structure: One JSON object per line
Use when: You want to stream, append, or process data incrementally

# Save to JSONL (each item on separate line)
training_logs = [
    {"epoch": 1, "loss": 0.8, "weights": np.random.randn(10)},
    {"epoch": 2, "loss": 0.6, "weights": np.random.randn(10)},
    {"epoch": 3, "loss": 0.4, "weights": np.random.randn(10)}
]
datason.save_ml(training_logs, "training.jsonl")

Core Functions¶

Save Functions¶

`save_ml(data, filepath, format=None)`¶

Optimized for ML workflows with automatic type preservation.

# Automatic format detection from extension
datason.save_ml(ml_data, "model.jsonl")     # JSONL format
datason.save_ml(ml_data, "model.json")      # JSON format
datason.save_ml(ml_data, "model.jsonl.gz")  # Compressed JSONL

# Explicit format override
datason.save_ml(ml_data, "model.txt", format="jsonl")

`save_secure(data, filepath, redact_pii=False, redact_fields=None, format=None)`¶

Save with security features - PII redaction and data protection.

sensitive_data = {
    "user": {
        "name": "John Doe",
        "email": "john@company.com",
        "ssn": "123-45-6789"
    },
    "api_key": "sk-1234567890"
}

# Automatic PII detection and redaction
datason.save_secure(
    sensitive_data,
    "secure_data.jsonl",
    redact_pii=True,
    redact_fields=["api_key"]
)

`save_api(data, filepath, format=None)`¶

Clean API data without ML optimizations.

api_response = {"status": "success", "data": [1, 2, 3]}
datason.save_api(api_response, "api_data.json")

`save_chunked(data, filepath, chunk_size=1000, format=None)`¶

Save large datasets in chunks for memory efficiency.

large_dataset = [{"record": i} for i in range(100000)]
datason.save_chunked(large_dataset, "large.jsonl", chunk_size=5000)

Load Functions¶

`load_smart_file(filepath, format=None)`¶

Intelligent loading with automatic type reconstruction.

# Load with smart type detection
data = list(datason.load_smart_file("model.jsonl"))

# Force specific format
data = list(datason.load_smart_file("data.txt", format="jsonl"))

`load_perfect_file(filepath, template, format=None)`¶

Perfect type reconstruction using templates.

# Define template for perfect reconstruction
template = {
    "weights": np.array([[0.0]]),           # NumPy array template
    "dataframe": df.iloc[:1],               # DataFrame template  
    "metadata": {}                          # Regular dict
}

# Load with perfect type reconstruction
data = list(datason.load_perfect_file("model.jsonl", template))

`load_basic_file(filepath, format=None)`¶

Basic loading without type reconstruction.

# Load as basic Python types only
data = list(datason.load_basic_file("data.jsonl"))

Streaming Functions¶

`stream_save_ml(filepath, format=None)`¶

Stream data for memory-efficient processing of large datasets.

# Stream training checkpoints
with datason.stream_save_ml("checkpoints.jsonl") as stream:
    for epoch in range(1000):
        checkpoint = {
            "epoch": epoch,
            "weights": model.get_weights(),
            "metrics": get_epoch_metrics(epoch)
        }
        stream.write(checkpoint)

Advanced Features¶

Compression¶

Automatic compression with .gz extension:

# These are automatically compressed
datason.save_ml(large_data, "data.jsonl.gz")
datason.save_ml(large_data, "data.json.gz")

# Load compressed files transparently  
data = list(datason.load_smart_file("data.jsonl.gz"))

Format Conversion¶

Easy conversion between JSON and JSONL:

# Load JSONL and save as JSON
jsonl_data = list(datason.load_smart_file("experiments.jsonl"))
datason.save_ml(jsonl_data, "experiments.json", format="json")

# Load JSON and save as JSONL  
json_data = list(datason.load_smart_file("config.json"))
datason.save_ml(json_data, "config.jsonl", format="jsonl")

Security and Redaction¶

Automatic PII detection and redaction:

customer_data = {
    "customers": [
        {
            "name": "Alice Johnson",
            "email": "alice@company.com",
            "ssn": "123-45-6789",
            "credit_card": "4532-1234-5678-9012"
        }
    ],
    "secrets": {
        "api_key": "sk-secret123",
        "database_url": "postgresql://user:pass@db/prod"
    }
}

# Comprehensive redaction
datason.save_secure(
    customer_data,
    "customers.jsonl",
    redact_pii=True,                    # Auto-detect SSN, credit cards, etc.
    redact_fields=["api_key", "database_url"]  # Explicit field redaction
)

# Load back with redaction metadata
secure_data = list(datason.load_smart_file("customers.jsonl"))[0]
print(f"Redacted {secure_data['redaction_summary']['total_redactions']} items")

ML Workflows¶

Model Training Pipeline¶

from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# 1. Prepare training data
X, y = make_classification(n_samples=1000, n_features=20)
feature_info = pd.DataFrame({
    "name": [f"feature_{i}" for i in range(20)],
    "importance": np.random.random(20)
})

# 2. Train model
model = RandomForestClassifier()
model.fit(X, y)

# 3. Package everything
ml_package = {
    "model": model,
    "training_data": {"X": X, "y": y, "features": feature_info},
    "metadata": {"accuracy": model.score(X, y), "timestamp": datetime.now()}
}

# 4. Save with compression
datason.save_ml(ml_package, "trained_model.jsonl.gz")

Streaming Training Logs¶

# Stream training progress
with datason.stream_save_ml("training_log.jsonl") as stream:
    for epoch in range(100):
        # Training step
        train_loss = train_one_epoch(model, train_loader)
        val_loss = validate(model, val_loader)

        # Log progress
        log_entry = {
            "epoch": epoch,
            "train_loss": train_loss,
            "val_loss": val_loss,
            "learning_rate": scheduler.get_lr()[0],
            "timestamp": datetime.now(),
            "model_weights_sample": model.layers[0].weight.data[:5].numpy()
        }
        stream.write(log_entry)

# Later: analyze training progress
logs = list(datason.load_smart_file("training_log.jsonl"))
train_losses = [log["train_loss"] for log in logs]

Experiment Tracking¶

# Track multiple experiments
experiments = []

for lr in [0.01, 0.001, 0.0001]:
    for batch_size in [32, 64, 128]:
        # Run experiment
        config = {"lr": lr, "batch_size": batch_size}
        model = train_model(config)
        results = evaluate_model(model, test_data)

        experiments.append({
            "config": config,
            "results": results,
            "model_weights": model.get_weights(),
            "timestamp": datetime.now()
        })

# Save all experiments
datason.save_ml(experiments, "experiments.jsonl.gz")

# Later: find best experiment
all_experiments = list(datason.load_smart_file("experiments.jsonl.gz"))
best = max(all_experiments, key=lambda x: x["results"]["accuracy"])

Data Types Supported¶

NumPy Arrays¶

Perfect preservation of shape, dtype, and data:

data = {
    "weights": np.random.randn(100, 50).astype(np.float32),
    "labels": np.random.randint(0, 10, 1000).astype(np.int64),
    "mask": np.random.choice([True, False], 1000)
}
datason.save_ml(data, "arrays.jsonl")
loaded = list(datason.load_smart_file("arrays.jsonl"))[0]

assert isinstance(loaded["weights"], np.ndarray)
assert loaded["weights"].dtype == np.float32
assert loaded["weights"].shape == (100, 50)

Pandas DataFrames¶

Complete preservation including dtypes and index:

df = pd.DataFrame({
    "id": range(1000),
    "timestamp": pd.date_range("2024-01-01", periods=1000, freq="H"),
    "value": np.random.random(1000),
    "category": pd.Categorical(np.random.choice(["A", "B", "C"], 1000))
})

datason.save_ml({"dataframe": df}, "dataframe.jsonl")
loaded = list(datason.load_smart_file("dataframe.jsonl"))[0]

assert isinstance(loaded["dataframe"], pd.DataFrame)
assert len(loaded["dataframe"]) == 1000
assert "timestamp" in loaded["dataframe"].columns

Complex Nested Structures¶

Handles arbitrarily nested data:

complex_data = {
    "neural_net": {
        "layers": [
            {"type": "dense", "weights": np.random.randn(784, 128)},
            {"type": "dense", "weights": np.random.randn(128, 10)}
        ],
        "optimizer_state": {
            "momentum": [np.random.randn(784, 128), np.random.randn(128, 10)],
            "learning_rate": 0.001
        }
    },
    "training_history": pd.DataFrame({
        "epoch": range(50),
        "loss": np.random.exponential(1, 50)
    })
}

Performance Tips¶

When to Use Each Format¶

Use Case	Recommended Format	Reason
Model checkpoints	`.jsonl.gz`	Good compression, line-by-line structure
Configuration files	`.json`	Single object, human readable
Training logs	`.jsonl`	Streamable, appendable
Large datasets	`.jsonl.gz`	Memory efficient, compressed
API responses	`.json`	Standard format
Experiment results	`.jsonl`	One experiment per line

Memory Efficiency¶

For large datasets, use streaming:

# Instead of loading everything into memory
big_data = [huge_record for huge_record in massive_dataset]  # ❌ Memory intensive
datason.save_ml(big_data, "huge.jsonl")

# Use streaming instead
with datason.stream_save_ml("huge.jsonl") as stream:      # ✅ Memory efficient
    for record in massive_dataset:
        stream.write(record)

Compression Benefits¶

Compression ratios vary by data type:

Text data: 5-10x compression
Repeated values: 20-50x compression
Numerical arrays: 2-5x compression
Mixed data: 3-8x compression

# Always use compression for large files
datason.save_ml(large_data, "data.jsonl.gz")  # ✅ Compressed
datason.save_ml(large_data, "data.jsonl")     # ❌ Uncompressed

Best Practices¶

File Naming Conventions¶

# Good naming patterns
"model_v1.2.3.jsonl.gz"           # Versioned model
"training_logs_2024-01-15.jsonl"  # Dated logs  
"experiments_lr_search.jsonl"     # Descriptive purpose
"customer_data_secure.jsonl"      # Security indication

# Use consistent extensions
".jsonl" or ".jsonl.gz"           # For line-by-line data
".json" or ".json.gz"             # For single objects

Directory Structure¶

ml_project/
├── models/
│   ├── checkpoints/
│   │   ├── epoch_001.jsonl.gz
│   │   ├── epoch_002.jsonl.gz
│   │   └── final_model.jsonl.gz
│   └── experiments/
│       ├── hyperparameter_search.jsonl
│       └── architecture_comparison.jsonl
├── data/
│   ├── training_data.jsonl.gz
│   ├── validation_data.jsonl.gz
│   └── features.json
└── logs/
    ├── training_2024-01-15.jsonl
    └── evaluation_2024-01-15.jsonl

Error Handling¶

try:
    data = list(datason.load_smart_file("model.jsonl"))
except FileNotFoundError:
    print("Model file not found")
except json.JSONDecodeError:
    print("Corrupted file")
except Exception as e:
    print(f"Unexpected error: {e}")

API Reference¶

Function Signatures¶

# Save functions
save_ml(data, filepath, format=None)
save_secure(data, filepath, redact_pii=False, redact_fields=None, format=None)  
save_api(data, filepath, format=None)
save_chunked(data, filepath, chunk_size=1000, format=None)

# Load functions  
load_smart_file(filepath, format=None) -> Iterator
load_perfect_file(filepath, template, format=None) -> Iterator
load_basic_file(filepath, format=None) -> Iterator

# Streaming
stream_save_ml(filepath, format=None) -> ContextManager

Parameters¶

data: Data to save (any serializable object)
filepath: Path to save/load (str or Path object)
format: Force format ("json" or "jsonl", auto-detected if None)
redact_pii: Enable automatic PII detection and redaction
redact_fields: List of field names to redact
template: Template object for perfect type reconstruction
chunk_size: Number of records per chunk for memory efficiency

Integration with Other Features¶

File operations integrate seamlessly with all datason features:

✅ ML Type Handlers: Automatic preservation of ML objects
✅ Security & Redaction: PII protection and field redaction
✅ Streaming: Memory-efficient processing
✅ Compression: Automatic .gz detection
✅ Templates: Perfect type reconstruction
✅ Progressive Complexity: basic/smart/perfect loading options

This makes file operations a true first-class citizen in the datason ecosystem!