# Synthetic example The synthetic example is the fastest way to see pybench end-to-end: it needs only numpy + scipy and runs in seconds. It comes in two parts: - **Part 1** uses the pybench CLI to catch two kinds of regression — a *global* one (every checkpoint drifts) and a *local* one (a single checkpoint spikes). - **Part 2** replays those same cases as Monte-Carlo experiments and shows *why* pybench's statistics beat the simpler tests it could have used. Both parts are built on one shared loss-curve sampler, so let's start there. ## The loss-curve sampler `synthetic.sample_loss_curves` draws `(n_seeds, n_steps)` noisy losses from an exponential-decay mean curve, `amp·exp(−step/tau) + floor`, with optional per-seed offsets (`seed_sigma`) that correlate the steps within a curve the way a real training run's steps are: ```python def sample_loss_curves(rng, *, n_seeds, n_steps, amp, tau, floor, noise, seed_sigma=0.0): steps = np.arange(n_steps) mean_curve = amp * np.exp(-steps / tau) + floor curves = mean_curve + rng.normal(0.0, noise, size=(n_seeds, n_steps)) if seed_sigma: curves = curves + rng.normal(0.0, seed_sigma, size=(n_seeds, 1)) return curves ``` ```{eval-rst} .. plot:: plots/sampler.py :caption: Six curves from ``sample_loss_curves`` (independent steps) around the dashed mean curve. Both parts below reuse this exact sampler. ``` ## Part 1: catching regressions with the CLI `benchmarks/bench_synthetic.py` samples one such curve per seed and reports `min:loss` at fixed checkpoints — exercising the `list[dict]` multi-step format and the `min:` lower-is-better convention. Two environment variables drive the walkthrough: `PYBENCH_SYNTHETIC_REGRESS` injects a regression (`global` lifts every checkpoint, `local` spikes the last), and `PYBENCH_SYNTHETIC_RESAMPLE` remaps each seed to a *different* curve so the re-run's scores no longer match the baseline. ```python from synthetic import sample_loss_curves _CHECKPOINTS = (1, 30, 100) def bench_synthetic(seed: int, *, n_seeds: int = 30) -> list[dict]: if os.environ.get("PYBENCH_SYNTHETIC_RESAMPLE"): seed = int(np.random.default_rng(seed).integers(2**32)) # different curve rng = np.random.default_rng(seed) curve = sample_loss_curves( rng, n_seeds=1, n_steps=max(_CHECKPOINTS) + 1, amp=1.0, tau=30.0, floor=0.10, noise=0.05, )[0] losses = {s: float(curve[s]) for s in _CHECKPOINTS} regress = os.environ.get("PYBENCH_SYNTHETIC_REGRESS") if regress == "global": losses = {s: v + 0.05 for s, v in losses.items()} elif regress == "local": losses[_CHECKPOINTS[-1]] += 0.20 return [{"step": s, "min:loss": losses[s]} for s in _CHECKPOINTS] ``` By default each seed reproduces its own curve exactly, so a clean re-run matches the baseline to the bit; the cases below add a regression — or resample the seeds — on top of that. The remap is deterministic, so every run is reproducible. Run the four cases in order. **First run — save the baseline.** The first run has nothing to compare against; it samples 30 seeds, stores them, and marks the benchmark **NEW**. ```console $ uv run --package synthetic pybench examples/synthetic/benchmarks/ bench_synthetic .......... NEW 1 metrics × 3 steps (baseline saved) ────────────────────────────────────────────────────────────── 0 failed, 0 passed, 1 new in 0.0s ``` **Case 1 — no regression → PASS.** Re-run unchanged. pybench reuses the stored seeds, every paired difference is zero, and nothing is flagged. ```console $ uv run --package synthetic pybench examples/synthetic/benchmarks/ bench_synthetic .......... PASS 1 metrics × 3 steps 0/3 slots flagged meta-p=1.000 ────────────────────────────────────────────────────────────── 0 failed, 1 passed, 0 new in 0.0s ``` **Case 2 — resampled seeds, no regression → PASS.** Set `PYBENCH_SYNTHETIC_RESAMPLE=1`: each seed now draws a *different* curve, so the current scores no longer equal the baseline's. There is still no real regression, so the verdict stays PASS — proof that the verdict is a *statistical* test, not a score-equality check. ```console $ PYBENCH_SYNTHETIC_RESAMPLE=1 uv run --package synthetic pybench examples/synthetic/benchmarks/ bench_synthetic .......... PASS 1 metrics × 3 steps 0/3 slots flagged meta-p=1.000 ────────────────────────────────────────────────────────────── 0 failed, 1 passed, 0 new in 0.0s ``` Unlike Case 1, no two scores match — the curves were redrawn — yet none of the paired differences is large enough to flag, so pybench passes the noisy re-run. **Case 3 — global regression → FAIL.** Lift every checkpoint. All three slots regress and the verdict flips. ```console $ PYBENCH_SYNTHETIC_REGRESS=global uv run --package synthetic pybench examples/synthetic/benchmarks/ bench_synthetic .......... FAIL 1 metrics × 3 steps 3/3 slots flagged meta-p=0.000 ────────────────────────────────────────────────────────────── 1 failed, 0 passed, 0 new in 0.0s ``` **Case 4 — local regression → FAIL.** Spike only the last checkpoint. A single flagged slot is enough; `-v` shows exactly which one: ```console $ PYBENCH_SYNTHETIC_REGRESS=local uv run --package synthetic pybench examples/synthetic/benchmarks/ -v bench_synthetic .......... FAIL 1 metrics × 3 steps 1/3 slots flagged meta-p=0.000 metric step baseline current Δ p min:loss 1 1.08±0.05 1.08±0.05 +0.0% 1.000 min:loss 30 0.46±0.04 0.46±0.04 +0.0% 1.000 min:loss 100 0.13±0.06 0.33±0.06 -155.3% 0.000 ✗ ────────────────────────────────────────────────────────────── 1 failed, 0 passed, 0 new in 0.1s ``` That a *single* regressed checkpoint flips the verdict is the whole point — and the reason for the statistics in Part 2. ## Part 2: why the severity permutation, rigorously `examples/synthetic/main.py` revisits the no-regression, global, and local regimes as Monte-Carlo experiments and pits pybench's verdict against the simpler tests it could have used. (The resampled Case 2 above is the single-shot version of the no-regression experiment here — over many replications it false-flags at exactly `alpha`.) It does **not** reimplement pybench's statistics: the pybench verdict here calls the very functions the CLI runs — `pybench.stats._severity` and `pybench.stats._sign_flip_meta_p`, the within-seed sign-flip permutation of the continuous severity `T = Σ max(0, t_crit − t_stat)` ([§3](how_it_works.md)). Only the *alternative* tests, which are not part of pybench, are written out there: - a **global t-test** pooling every `(seed, step)` difference at once; - a **per-step t-test + binomial** on how many steps came out significant; - a **sign-flip permutation on the flagged count** (it respects the dependency between steps but throws away each slot's magnitude); - the **sign-flip permutation on the severity** (pybench's actual verdict, which keeps the magnitude). ```bash uv run --package synthetic python examples/synthetic/main.py ``` The same exponential-decay sampler feeds all three cases; only the injected regression and the noise structure differ: ```{eval-rst} .. plot:: plots/regimes.py :caption: Left, correlated curves (the no-regression case) — a per-seed offset shifts a whole curve, so the steps move together. Right, one regressed checkpoint (the local case) — a single step spikes while the rest match the baseline. ``` Each test's **false-positive rate** (Case 1) and **power** (Cases 2 and 3), over 200 replications: ```text Case 1 — no regression, but the steps within a seed correlate test false-positive rate (target = 0.05) global t-test (seed×step) 0.405 per-step t + binomial 0.170 sign-flip on count 0.040 sign-flip on severity (pybench) 0.030 Case 2 — a global regression: every checkpoint's loss rises a little test detection rate (higher is better) global t-test (seed×step) 1.000 per-step t + binomial 1.000 sign-flip on count 1.000 sign-flip on severity (pybench) 1.000 Case 3 — a local regression: one checkpoint spikes, the rest unchanged test detection rate (higher is better) global t-test (seed×step) 0.225 per-step t + binomial 0.070 sign-flip on count 0.050 sign-flip on severity (pybench) 0.980 ``` **Case 1 — false alarms.** With *no real change*, the global t-test pools correlated slots as independent (its standard error is too small) and the per-step binomial assumes the per-step rejections are independent (correlated steps reject together, over-dispersing the count) — they cry wolf 40% and 17% of the time. Both permutation tests stay at `alpha`: a permutation test is exactly calibrated whatever statistic it permutes. **Case 2 — the easy regression.** When every checkpoint drifts together, all four tests catch it. This is the global t-test's home turf, and pybench gives up no power here — its strengths in the other two cases cost it nothing on a broad regression. **Case 3 — missed regressions.** A single checkpoint regresses sharply while the rest are unchanged. The global t-test dilutes that one spike across all 200 steps; the per-step binomial sees one significant step, indistinguishable from its `alpha`-rate false positives; and the **sign-flip on count** throws away the spike's magnitude — *one flag is one flag* — so it misses too (5% detection, no better than chance). Only the **sign-flip on severity**, which keeps the spike's magnitude, catches it (98%). The lesson is twofold. Assuming the steps are independent (global t-test, per-step binomial) raises false alarms on correlated noise. And *discarding effect magnitude* (the count permutation) misses a regression hiding in a single slot. pybench's permutation of the **severity** statistic does neither — which is exactly why §3.2 chose it. For the full walk-through of the machinery, see [How it works](how_it_works.md).