Remote Sensing content

🏠 Home

SCIENTIFIC COMPUTING PATTERN

Cache-or-Run Pipeline
Complete Guide with Examples

SHA-256 hashing, a CacheManager class, and a Python decorator.

Contents

  1. Why do we need this?
  2. Step 1 — Naive version and its fatal bug
  3. Step 2 — Hash-based caching (the fix)
  4. How SHA-256 hashing works internally
  5. Step 3 — CacheManager class
  6. Step 4 — @cacheable decorator
  7. Cache invalidation and versioning
Section 01

Why do we need this?

In scientific computing you often have a slow computation — a Monte Carlo simulation, a radiative transfer run, a full OE retrieval over 1000 scenes. This can take minutes. During development you run the same notebook dozens of times. Without caching, you pay the full cost every single time.

The cache-or-run pattern solves this with one runtime decision:

Call function
run_monte_carlo(snr=300)
Result on disk?
Check hash key
YES → Cache hit
Load in 0.002s
NO → Cache miss
Compute + save
Key principle: a pure function — same inputs always produce same outputs — can be stored permanently. There is no reason to ever recompute it unless the inputs change.
Section 02

Step 1 — Naive version and its fatal bug

The simplest approach: just check if a single fixed file exists.

step1_naive.py Python
CACHE_FILE = 'cache/result.npz'

def cache_or_run_naive(snr, n_scenes):
    if os.path.exists(CACHE_FILE):
        # Load from disk -- but is this actually the right result?
        data = np.load(CACHE_FILE, allow_pickle=True)
        return data['result'].item()
    else:
        # Compute and save
        result = expensive_computation(snr, n_scenes)
        np.savez(CACHE_FILE, result=result)
        return result
Fatal bug: call with snr=300 — saves the result. Then call with snr=500 — finds the file and returns the snr=300 result silently, with no error, no warning. Wrong result every time after the first.

RUN 1 -- snr=300 --> [CACHE MISS] precision=0.00326 2.05s

RUN 2 -- snr=300 --> [CACHE HIT ] precision=0.00326 0.00s OK

RUN 3 -- snr=500 --> [CACHE HIT ] precision=0.00326 0.00s BUG -- snr=300 result returned!

The fix: make the cache filename depend on the inputs. If inputs change, the filename changes, the old file is not found, and recomputation happens automatically. This is done with hashing.

Section 03

Step 2 — Hash-based caching (the fix)

Instead of one fixed filename, compute a unique fingerprint of all input parameters and use that as the filename. Different inputs → different fingerprint → different file.

step2_hashing.py Python
import hashlib, json

def make_cache_key(params: dict) -> str:
    canonical = json.dumps(params, sort_keys=True)   # deterministic order
    hash_hex  = hashlib.sha256(canonical.encode()).hexdigest()
    return hash_hex[:16]                              # first 16 chars

def cache_or_run(snr, n_scenes, seed=42):
    params = {'snr': snr, 'n_scenes': n_scenes,
              'seed': seed, 'version': 1}

    key        = make_cache_key(params)           # e.g. 'c66f1de772e87e1f'
    cache_path = f'cache/result_{key}.npz'       # unique filename per input set

    if os.path.exists(cache_path):
        data = np.load(cache_path, allow_pickle=True)
        return data['result'].item()
    else:
        result = expensive_computation(snr, n_scenes, seed)
        np.savez(cache_path, result=result)
        return result
RUN 1 snr=300 key=c66f1de772e87e1f --> [CACHE MISS] 2.04s

RUN 2 snr=300 key=c66f1de772e87e1f --> [CACHE HIT ] 0.00s correct

RUN 3 snr=500 key=9285575b7463bb49 --> [CACHE MISS] 2.00s different hash!

RUN 4 snr=500 key=9285575b7463bb49 --> [CACHE HIT ] 0.00s correct

Each unique combination of inputs gets its own cache file. Changing any single parameter produces a different hash and therefore a different filename — the old cache is never accidentally returned.

Section 04

How SHA-256 hashing works internally

SHA-256 is a cryptographic hash function. It takes any input and produces a fixed 64-hex-character output called a digest. Three properties make it ideal for caching:

PropertyMeaningWhy it matters for caching
DeterministicSame input always gives same hashSame params → same filename → cache hit
Avalanche effectTiny input change → completely different hashsnr=300 vs snr=301 → different files, no collision
Fixed output sizeAlways 64 hex chars regardless of input sizeFilenames are always short and uniform

Step-by-step: params dict → cache key

internal steps Python
# Step 1: dict --> canonical JSON string (sort_keys is critical!)
params    = {'snr': 300, 'n_scenes': 100, 'seed': 42}
canonical = json.dumps(params, sort_keys=True)
# result: '{"n_scenes": 100, "seed": 42, "snr": 300}'

# Step 2: string --> bytes
raw_bytes = canonical.encode()
# result: b'{"n_scenes": 100, "seed": 42, "snr": 300}'

# Step 3: bytes --> SHA-256 digest (64 hex chars)
full_hash = hashlib.sha256(raw_bytes).hexdigest()
# result: 'c66f1de772e87e1f3a8b9c2d...'  (64 chars total)

# Step 4: take first 16 chars (sufficient for uniqueness)
key = full_hash[:16]
# result: 'c66f1de772e87e1f'

# WHY sort_keys=True?
# {'snr':300,'n':100} and {'n':100,'snr':300} are identical dicts
# but produce different JSON strings without sorting!
# sort_keys ensures both always produce the same canonical string.

Avalanche effect — one changed value, completely different hash

PARAMETERS
SHA-256 KEY (first 16 chars)
snr=300, n=100, seed=42, v=1
c66f1de772e87e1f
snr=500, n=100, seed=42, v=1
9285575b7463bb49
snr=300, n=500, seed=42, v=1
6fac88a9a7f15806
snr=300, n=100, seed=99, v=1
016ca99823aed73b
snr=300, n=100, seed=42, v=2
c8a86fc5d05a0dc8

Every single change — even bumping the version by 1 — produces a completely different 16-character key. This is the avalanche effect guaranteeing no accidental cache collisions.

Section 05

Step 3 — CacheManager class

In a real pipeline, wrap all the logic into a class that also stores metadata — when the result was computed, how long it took, and what parameters produced it. This makes the cache auditable and manageable.

CacheManager — full implementation Python
class CacheManager:

    def __init__(self, cache_dir='cache', verbose=True):
        self.cache_dir = cache_dir
        self.verbose   = verbose
        os.makedirs(cache_dir, exist_ok=True)

    def _make_key(self, params: dict) -> str:
        canonical = json.dumps(params, sort_keys=True, default=str)
        return hashlib.sha256(canonical.encode()).hexdigest()[:16]

    def _path(self, key: str) -> str:
        return os.path.join(self.cache_dir, f'{key}.npz')

    def run_or_load(self, params, compute_fn, *args, **kwargs):
        """Core cache-or-run logic."""
        key  = self._make_key(params)
        path = self._path(key)

        if os.path.exists(path):                       # cache HIT
            data = np.load(path, allow_pickle=True)
            if self.verbose:
                meta = data['meta'].item()
                print(f"[HIT] key={key}  was computed on {meta['computed_at']}")
            return data['result'].item()

        if self.verbose:                               # cache MISS
            print(f"[MISS] key={key}  computing...")
        t0     = time.time()
        result = compute_fn(*args, **kwargs)
        dur    = time.time() - t0
        meta   = {'key':         key,
                  'computed_at': datetime.now().isoformat(),
                  'duration_s':  round(dur, 3),
                  'params':      params}
        np.savez(path, result=result, meta=meta)
        return result

    def invalidate(self, params: dict):
        """Delete the cache file for a specific set of params."""
        path = self._path(self._make_key(params))
        if os.path.exists(path):
            os.remove(path)
            print(f"[INVALIDATED] {path}")

    def clear_all(self):
        """Delete all .npz files in the cache directory."""
        for f in os.listdir(self.cache_dir):
            if f.endswith('.npz'):
                os.remove(os.path.join(self.cache_dir, f))

    def list_cache(self):
        """Print a table of all cached results with their metadata."""
        for f in os.listdir(self.cache_dir):
            data = np.load(os.path.join(self.cache_dir, f), allow_pickle=True)
            m    = data['meta'].item()
            print(m['key'], m['computed_at'], m['duration_s'], m['params'])
Key Computed at Duration Params
535049cbf0ad6e34 2026-03-31 07:08:26 2.00s {"snr": 500, "n_scenes": 1000 ...}
6bf98e1fe60a10c9 2026-03-31 07:08:29 3.00s {"snr_list": [100, 200, ...] ...}
75163166b943ab9c 2026-03-31 07:08:24 2.02s {"snr": 300, "n_scenes": 1000 ...}
Section 06

Step 4 — @cacheable decorator

A Python decorator adds caching to any function with one line, without touching the function body. This is the cleanest production pattern for a scientific notebook or pipeline.

decorator definition Python
def cacheable(version=1, cache=_default_cache):
    def decorator(fn):
        @functools.wraps(fn)
        def wrapper(*args, **kwargs):
            # Introspect: get all argument names and their values
            sig   = inspect.signature(fn)
            bound = sig.bind(*args, **kwargs)
            bound.apply_defaults()               # include default args too
            params = dict(bound.arguments)
            # {'snr': 300, 'n_scenes': 1000, 'seed': 42}

            params['__fn__']      = fn.__name__ # avoid cross-function collisions
            params['__version__'] = version

            return cache.run_or_load(
                params=params, compute_fn=fn, *args, **kwargs)
        return wrapper
    return decorator

Usage — add one line above any slow function

applying the decorator Python
@cacheable(version=1)                 # <-- this one line adds full caching
def run_monte_carlo(snr, n_scenes, seed=42):
    """Slow Monte Carlo CO2 retrieval validation. Function unchanged."""
    np.random.seed(seed)
    errors = np.random.normal(0, 1.0/snr, size=n_scenes)
    return {'errors': errors,
            'precision': float(np.std(errors)),
            'bias':      float(np.mean(errors))}

r = run_monte_carlo(snr=300, n_scenes=1000)  # first call:  2.05s
r = run_monte_carlo(snr=300, n_scenes=1000)  # second call: 0.002s
r = run_monte_carlo(snr=500, n_scenes=1000)  # new snr:     2.00s (new hash)
r = run_monte_carlo(snr=500, n_scenes=1000)  # again:       0.002s

How inspect works here:

inspect.signature(fn) reads the parameter names at definition time.

sig.bind(*args, **kwargs) maps the actual call values to those names.

apply_defaults() fills in any missing optional arguments. The result is a dict like

{'snr': 300, 'n_scenes': 1000, 'seed': 42}

— exactly what gets hashed. This means even if you call the function positionally or with keywords, the same inputs always produce the same cache key.

Section 07

Cache invalidation and versioning

Sometimes you need to force recomputation even when inputs have not changed — for example after fixing a bug in the algorithm. There are three strategies:

SituationStrategyCode
Fixed a bug in the computationBump version in decorator@cacheable(version=2)
One specific result is staleSelective invalidationcache.invalidate({'snr':300,...})
Wipe everything and restartClear allcache.clear_all()
External data updated (HITRAN, NWP)Add data version to params'hitran_ver': '2020'
version bump forces global recomputation Python
# Before finding the bug -- all results cached at version=1
@cacheable(version=1)
def run_monte_carlo(snr, n_scenes): ...

# After fixing a bug in the forward model, bump the version:
@cacheable(version=2)   # all version=1 hashes are now different
def run_monte_carlo(snr, n_scenes): ...

# On next call -- old .npz files remain on disk but are never found
# because the hash now includes version=2 instead of version=1.
# All results are recomputed automatically. Zero manual cleanup needed.
Rule of thumb: always include a version key in your params. Treat it like a migration number in a database. Bump it whenever you change the algorithm, fix a bug, or update any upstream dependency — spectroscopic database version, NWP reanalysis vintage, instrument calibration version, etc.