Performance Optimization
Tips for optimizing performance when processing large datasets.
Parallel Processing
For processing thousands of profiles, enable parallel processing to use multiple CPU cores:
from mpspline import mpspline
# Enable parallel processing
df = mpspline(
profiles,
parallel=True, # Enable multiprocessing
n_workers=4, # Use 4 CPU cores (default: all available)
batch_size=200, # Process 200 profiles per batch
)
Performance characteristics:
Single-threaded: ~100-500 profiles/second (depends on profile complexity)
Multi-threaded (4 cores): ~400-2000 profiles/second (3-4x speedup)
Multi-threaded (8 cores): ~800-4000 profiles/second (6-8x speedup)
Batch Size Tuning
The batch_size parameter controls memory usage during batch processing:
# Large batches (faster, more memory)
df = mpspline(profiles, batch_size=1000)
# Small batches (slower, less memory)
df = mpspline(profiles, batch_size=50)
Recommendations:
Large memory (> 8GB): Use batch_size=500-1000
Medium memory (4-8GB): Use batch_size=100-300
Limited memory (< 4GB): Use batch_size=50-100
Reducing Computation
Process only the properties you need:
# ✓ Good: Process only clay
result = mpspline(profile, var_name=['clay'])
# ✗ Slower: Process all numeric fields
result = mpspline(profile)
Profile Complexity Impact
Processing time increases with:
Number of horizons in a profile
Number of properties being harmonized
Depth interval count
Example:
# Fast: 3 horizons, 1 property, standard depths
result = mpspline(profile, var_name=['clay'])
# Slower: 20 horizons, 5 properties, 20 custom depths
result = mpspline(
profile,
var_name=['clay', 'sand', 'silt', 'om', 'bdensity'],
target_depths=[(0, 5), (5, 10), (10, 15), ...] # 20 intervals
)
Memory Usage
Memory usage depends on:
Dataset size: Number of profiles × number of properties
Batch size: Larger batches = more memory
Number of workers: More workers = more copies of data
Example memory usage (single worker):
1,000 profiles × 5 properties: ~5-10 MB
100,000 profiles × 5 properties: ~500-1000 MB
1,000,000 profiles × 5 properties: ~5-10 GB
With parallel processing (8 workers):
Multiply memory usage by number of workers + 20% overhead.
Smoothing Parameter Impact
The lam parameter doesn’t significantly affect computation time but affects iteration:
Default (lam=0.1): Balanced fit, ~1-5 iterations
High smoothing (lam=0.01): May need more iterations, but still fast
Low smoothing (lam=1.0): May converge faster
In practice, the difference is negligible for performance.
Benchmarks
Sample run times on a 2022 laptop (Intel i5, 8GB RAM):
Single profile processing:
1 profile, 1 property: ~5-10 ms
1 profile, 5 properties: ~25-50 ms
1 profile, 10 properties: ~50-100 ms
Batch processing (1,000 profiles, 5 properties):
Single-threaded: ~2-5 seconds
4 workers: ~0.5-1.5 seconds (3-4x speedup)
8 workers: ~0.3-0.8 seconds (6-8x speedup)
Large batch (100,000 profiles, 1 property):
Single-threaded: ~30-60 seconds
4 workers: ~8-20 seconds
8 workers: ~5-10 seconds
Quick Performance Checklist
For best performance with large datasets:
✓ Enable parallel processing: parallel=True
✓ Use appropriate batch size for your memory (default 100-500)
✓ Process only needed properties: var_name=[‘clay’, ‘sand’]
✓ Consider smoothing parameter based on data quality
✓ Use a machine with multiple CPU cores
✓ Monitor memory usage with large datasets
Example optimized code:
from mpspline import mpspline
# Optimized for performance
df = mpspline(
profiles,
var_name=['clay'], # Only clay
parallel=True, # Use all CPU cores
batch_size=500, # Balanced batch size
n_workers=None, # Auto-detect CPU count
)
Profiling
To profile performance of your specific dataset:
import time
from mpspline import mpspline
start = time.time()
df = mpspline(profiles, parallel=True, batch_size=500)
elapsed = time.time() - start
profiles_per_second = len(profiles) / elapsed
print(f"Processed {len(profiles)} profiles in {elapsed:.1f}s")
print(f"Speed: {profiles_per_second:.0f} profiles/second")