Getting started¶
What is pybench?¶
pybench is a CLI tool that catches performance regressions in code whose
output is a random number rather than a fixed value: model accuracy, a loss
curve, a solver’s score. Such metrics are noisy: they shift from run to run
because of random seeds, so a single before/after comparison can’t tell a real
regression from luck.
pybench removes the noise the right way:
Discovery Any function named
bench_*that takes aseedand returns a score is a benchmark. No config file, no registration.Paired by construction. The seeds sampled for the first (baseline) run are stored; every later run reuses the same seeds. Comparing identical conditions cancels seed-to-seed variance, so far fewer seeds detect the same regression.
An honest verdict. Each
(step, metric)slot gets a one-sided paired t-test, and the whole benchmark is judged by a within-seed sign-flip permutation test. That test makes no independence assumption, so correlated metrics and steps don’t inflate false alarms.CI-native.
pybenchexits non-zero when any benchmark regresses, so it drops straight into CI likepytest.
A benchmark returns one of three shapes:
def bench_a(seed): return 0.91 # scalar
def bench_b(seed): return {"accuracy": 0.91, "min:loss": 0.42} # multiple metrics
def bench_c(seed): # multi-step curve
return [{"step": 1, "accuracy": 0.5, "min:loss": 1.0}, {"step": 10, "accuracy": 0.91, "min:loss": 0.42}]
Scores follow a higher-is-better convention. For metrics where lower is
better (loss, error), prefix the key with min: and return the raw value —
pybench flips the sign internally so that “a decrease in goodness is a
regression.”
Quickstart¶
Install pybench:
uv add pybench # or: pip install pybench
Write a bench_* function that takes a seed and returns a score:
# benchmarks/bench_model.py
def bench_accuracy(seed: int) -> float:
return train_and_score(seed)
Then drive it from the CLI:
pybench # 1st time: samples seeds, saves a baseline, marks NEW
pybench # later: reruns on the same seeds, marks PASS / FAIL
pybench update # re-baseline after an intended change
pybench show # print current baseline stats (--history for per-commit)
The three ways to invoke pybench:
Command |
What it does |
Writes to disk? |
|---|---|---|
|
Discover, run, compare; exit 1 if any benchmark fails |
Only the first time (baseline init) |
|
Re-run and overwrite the baseline (resamples fresh seeds) |
Yes |
|
Print current baseline stats ( |
No |
Per-benchmark settings are keyword-only defaults — no config file:
def bench_training(seed: int, *, n_seeds: int = 50, alpha: float = 0.01,
min_effect: float = 0.02, workers: int = 4) -> list[dict]:
...
Parameter |
Default |
Meaning |
|---|---|---|
|
|
Seeds sampled for the baseline |
|
|
Significance threshold |
|
|
Minimum relative drop to flag (suppress trivia) |
|
|
Parallel seed processes (keep |
The baseline lives at .pybench/baselines.jsonl (one line per benchmark).
Commit it to git — do not gitignore it. History is delegated to git, and
pybench show --history reconstructs the baseline at every commit that touched
it.