Performance Optimization

Tips for optimizing performance when processing large datasets.

Parallel Processing

For processing thousands of profiles, enable parallel processing to use multiple CPU cores:

from mpspline import mpspline

# Enable parallel processing
df = mpspline(
    profiles,
    parallel=True,      # Enable multiprocessing
    n_workers=4,        # Use 4 CPU cores (default: all available)
    batch_size=200,     # Process 200 profiles per batch
)

Performance characteristics:

  • Single-threaded: ~100-500 profiles/second (depends on profile complexity)

  • Multi-threaded (4 cores): ~400-2000 profiles/second (3-4x speedup)

  • Multi-threaded (8 cores): ~800-4000 profiles/second (6-8x speedup)

Batch Size Tuning

The batch_size parameter controls memory usage during batch processing:

# Large batches (faster, more memory)
df = mpspline(profiles, batch_size=1000)

# Small batches (slower, less memory)
df = mpspline(profiles, batch_size=50)

Recommendations:

  • Large memory (> 8GB): Use batch_size=500-1000

  • Medium memory (4-8GB): Use batch_size=100-300

  • Limited memory (< 4GB): Use batch_size=50-100

Reducing Computation

Process only the properties you need:

# ✓ Good: Process only clay
result = mpspline(profile, var_name=['clay'])

# ✗ Slower: Process all numeric fields
result = mpspline(profile)

Profile Complexity Impact

Processing time increases with:

  1. Number of horizons in a profile

  2. Number of properties being harmonized

  3. Depth interval count

Example:

# Fast: 3 horizons, 1 property, standard depths
result = mpspline(profile, var_name=['clay'])

# Slower: 20 horizons, 5 properties, 20 custom depths
result = mpspline(
    profile,
    var_name=['clay', 'sand', 'silt', 'om', 'bdensity'],
    target_depths=[(0, 5), (5, 10), (10, 15), ...]  # 20 intervals
)

Memory Usage

Memory usage depends on:

  • Dataset size: Number of profiles × number of properties

  • Batch size: Larger batches = more memory

  • Number of workers: More workers = more copies of data

Example memory usage (single worker):

  • 1,000 profiles × 5 properties: ~5-10 MB

  • 100,000 profiles × 5 properties: ~500-1000 MB

  • 1,000,000 profiles × 5 properties: ~5-10 GB

With parallel processing (8 workers):

Multiply memory usage by number of workers + 20% overhead.

Smoothing Parameter Impact

The lam parameter doesn’t significantly affect computation time but affects iteration:

  • Default (lam=0.1): Balanced fit, ~1-5 iterations

  • High smoothing (lam=0.01): May need more iterations, but still fast

  • Low smoothing (lam=1.0): May converge faster

In practice, the difference is negligible for performance.

Benchmarks

Sample run times on a 2022 laptop (Intel i5, 8GB RAM):

Single profile processing:

  • 1 profile, 1 property: ~5-10 ms

  • 1 profile, 5 properties: ~25-50 ms

  • 1 profile, 10 properties: ~50-100 ms

Batch processing (1,000 profiles, 5 properties):

  • Single-threaded: ~2-5 seconds

  • 4 workers: ~0.5-1.5 seconds (3-4x speedup)

  • 8 workers: ~0.3-0.8 seconds (6-8x speedup)

Large batch (100,000 profiles, 1 property):

  • Single-threaded: ~30-60 seconds

  • 4 workers: ~8-20 seconds

  • 8 workers: ~5-10 seconds

Quick Performance Checklist

For best performance with large datasets:

  1. ✓ Enable parallel processing: parallel=True

  2. ✓ Use appropriate batch size for your memory (default 100-500)

  3. ✓ Process only needed properties: var_name=[‘clay’, ‘sand’]

  4. ✓ Consider smoothing parameter based on data quality

  5. ✓ Use a machine with multiple CPU cores

  6. ✓ Monitor memory usage with large datasets

Example optimized code:

from mpspline import mpspline

# Optimized for performance
df = mpspline(
    profiles,
    var_name=['clay'],           # Only clay
    parallel=True,               # Use all CPU cores
    batch_size=500,              # Balanced batch size
    n_workers=None,              # Auto-detect CPU count
)

Profiling

To profile performance of your specific dataset:

import time
from mpspline import mpspline

start = time.time()
df = mpspline(profiles, parallel=True, batch_size=500)
elapsed = time.time() - start

profiles_per_second = len(profiles) / elapsed
print(f"Processed {len(profiles)} profiles in {elapsed:.1f}s")
print(f"Speed: {profiles_per_second:.0f} profiles/second")