SHA-256 hashing, a CacheManager class, and a Python decorator.
In scientific computing you often have a slow computation — a Monte Carlo simulation, a radiative transfer run, a full OE retrieval over 1000 scenes. This can take minutes. During development you run the same notebook dozens of times. Without caching, you pay the full cost every single time.
The cache-or-run pattern solves this with one runtime decision:
The simplest approach: just check if a single fixed file exists.
CACHE_FILE = 'cache/result.npz' def cache_or_run_naive(snr, n_scenes): if os.path.exists(CACHE_FILE): # Load from disk -- but is this actually the right result? data = np.load(CACHE_FILE, allow_pickle=True) return data['result'].item() else: # Compute and save result = expensive_computation(snr, n_scenes) np.savez(CACHE_FILE, result=result) return result
The fix: make the cache filename depend on the inputs. If inputs change, the filename changes, the old file is not found, and recomputation happens automatically. This is done with hashing.
Instead of one fixed filename, compute a unique fingerprint of all input parameters and use that as the filename. Different inputs → different fingerprint → different file.
import hashlib, json def make_cache_key(params: dict) -> str: canonical = json.dumps(params, sort_keys=True) # deterministic order hash_hex = hashlib.sha256(canonical.encode()).hexdigest() return hash_hex[:16] # first 16 chars def cache_or_run(snr, n_scenes, seed=42): params = {'snr': snr, 'n_scenes': n_scenes, 'seed': seed, 'version': 1} key = make_cache_key(params) # e.g. 'c66f1de772e87e1f' cache_path = f'cache/result_{key}.npz' # unique filename per input set if os.path.exists(cache_path): data = np.load(cache_path, allow_pickle=True) return data['result'].item() else: result = expensive_computation(snr, n_scenes, seed) np.savez(cache_path, result=result) return result
Each unique combination of inputs gets its own cache file. Changing any single parameter produces a different hash and therefore a different filename — the old cache is never accidentally returned.
SHA-256 is a cryptographic hash function. It takes any input and produces a fixed 64-hex-character output called a digest. Three properties make it ideal for caching:
| Property | Meaning | Why it matters for caching |
|---|---|---|
| Deterministic | Same input always gives same hash | Same params → same filename → cache hit |
| Avalanche effect | Tiny input change → completely different hash | snr=300 vs snr=301 → different files, no collision |
| Fixed output size | Always 64 hex chars regardless of input size | Filenames are always short and uniform |
# Step 1: dict --> canonical JSON string (sort_keys is critical!) params = {'snr': 300, 'n_scenes': 100, 'seed': 42} canonical = json.dumps(params, sort_keys=True) # result: '{"n_scenes": 100, "seed": 42, "snr": 300}' # Step 2: string --> bytes raw_bytes = canonical.encode() # result: b'{"n_scenes": 100, "seed": 42, "snr": 300}' # Step 3: bytes --> SHA-256 digest (64 hex chars) full_hash = hashlib.sha256(raw_bytes).hexdigest() # result: 'c66f1de772e87e1f3a8b9c2d...' (64 chars total) # Step 4: take first 16 chars (sufficient for uniqueness) key = full_hash[:16] # result: 'c66f1de772e87e1f' # WHY sort_keys=True? # {'snr':300,'n':100} and {'n':100,'snr':300} are identical dicts # but produce different JSON strings without sorting! # sort_keys ensures both always produce the same canonical string.
Every single change — even bumping the version by 1 — produces a completely different 16-character key. This is the avalanche effect guaranteeing no accidental cache collisions.
In a real pipeline, wrap all the logic into a class that also stores metadata — when the result was computed, how long it took, and what parameters produced it. This makes the cache auditable and manageable.
class CacheManager: def __init__(self, cache_dir='cache', verbose=True): self.cache_dir = cache_dir self.verbose = verbose os.makedirs(cache_dir, exist_ok=True) def _make_key(self, params: dict) -> str: canonical = json.dumps(params, sort_keys=True, default=str) return hashlib.sha256(canonical.encode()).hexdigest()[:16] def _path(self, key: str) -> str: return os.path.join(self.cache_dir, f'{key}.npz') def run_or_load(self, params, compute_fn, *args, **kwargs): """Core cache-or-run logic.""" key = self._make_key(params) path = self._path(key) if os.path.exists(path): # cache HIT data = np.load(path, allow_pickle=True) if self.verbose: meta = data['meta'].item() print(f"[HIT] key={key} was computed on {meta['computed_at']}") return data['result'].item() if self.verbose: # cache MISS print(f"[MISS] key={key} computing...") t0 = time.time() result = compute_fn(*args, **kwargs) dur = time.time() - t0 meta = {'key': key, 'computed_at': datetime.now().isoformat(), 'duration_s': round(dur, 3), 'params': params} np.savez(path, result=result, meta=meta) return result def invalidate(self, params: dict): """Delete the cache file for a specific set of params.""" path = self._path(self._make_key(params)) if os.path.exists(path): os.remove(path) print(f"[INVALIDATED] {path}") def clear_all(self): """Delete all .npz files in the cache directory.""" for f in os.listdir(self.cache_dir): if f.endswith('.npz'): os.remove(os.path.join(self.cache_dir, f)) def list_cache(self): """Print a table of all cached results with their metadata.""" for f in os.listdir(self.cache_dir): data = np.load(os.path.join(self.cache_dir, f), allow_pickle=True) m = data['meta'].item() print(m['key'], m['computed_at'], m['duration_s'], m['params'])
| Key | Computed at | Duration | Params |
|---|---|---|---|
| 535049cbf0ad6e34 | 2026-03-31 07:08:26 | 2.00s | {"snr": 500, "n_scenes": 1000 ...} |
| 6bf98e1fe60a10c9 | 2026-03-31 07:08:29 | 3.00s | {"snr_list": [100, 200, ...] ...} |
| 75163166b943ab9c | 2026-03-31 07:08:24 | 2.02s | {"snr": 300, "n_scenes": 1000 ...} |
A Python decorator adds caching to any function with one line, without touching the function body. This is the cleanest production pattern for a scientific notebook or pipeline.
def cacheable(version=1, cache=_default_cache): def decorator(fn): @functools.wraps(fn) def wrapper(*args, **kwargs): # Introspect: get all argument names and their values sig = inspect.signature(fn) bound = sig.bind(*args, **kwargs) bound.apply_defaults() # include default args too params = dict(bound.arguments) # {'snr': 300, 'n_scenes': 1000, 'seed': 42} params['__fn__'] = fn.__name__ # avoid cross-function collisions params['__version__'] = version return cache.run_or_load( params=params, compute_fn=fn, *args, **kwargs) return wrapper return decorator
@cacheable(version=1) # <-- this one line adds full caching def run_monte_carlo(snr, n_scenes, seed=42): """Slow Monte Carlo CO2 retrieval validation. Function unchanged.""" np.random.seed(seed) errors = np.random.normal(0, 1.0/snr, size=n_scenes) return {'errors': errors, 'precision': float(np.std(errors)), 'bias': float(np.mean(errors))} r = run_monte_carlo(snr=300, n_scenes=1000) # first call: 2.05s r = run_monte_carlo(snr=300, n_scenes=1000) # second call: 0.002s r = run_monte_carlo(snr=500, n_scenes=1000) # new snr: 2.00s (new hash) r = run_monte_carlo(snr=500, n_scenes=1000) # again: 0.002s
How inspect works here:
inspect.signature(fn) reads the parameter names at definition time.
sig.bind(*args, **kwargs) maps the actual call values to those names.
apply_defaults() fills in any missing optional arguments. The result is a dict like
{'snr': 300, 'n_scenes': 1000, 'seed': 42}
— exactly what gets hashed. This means even if you call the function positionally or with keywords, the same inputs always produce the same cache key.
Sometimes you need to force recomputation even when inputs have not changed — for example after fixing a bug in the algorithm. There are three strategies:
| Situation | Strategy | Code |
|---|---|---|
| Fixed a bug in the computation | Bump version in decorator | @cacheable(version=2) |
| One specific result is stale | Selective invalidation | cache.invalidate({'snr':300,...}) |
| Wipe everything and restart | Clear all | cache.clear_all() |
| External data updated (HITRAN, NWP) | Add data version to params | 'hitran_ver': '2020' |
# Before finding the bug -- all results cached at version=1 @cacheable(version=1) def run_monte_carlo(snr, n_scenes): ... # After fixing a bug in the forward model, bump the version: @cacheable(version=2) # all version=1 hashes are now different def run_monte_carlo(snr, n_scenes): ... # On next call -- old .npz files remain on disk but are never found # because the hash now includes version=2 instead of version=1. # All results are recomputed automatically. Zero manual cleanup needed.
version key in your params. Treat it like a migration number in a database. Bump it whenever you change the algorithm, fix a bug, or update any upstream dependency — spectroscopic database version, NWP reanalysis vintage, instrument calibration version, etc.