Getting started¶

What is pybench?¶

pybench is a CLI tool that catches performance regressions in code whose output is a random number rather than a fixed value: model accuracy, a loss curve, a solver’s score. Such metrics are noisy: they shift from run to run because of random seeds, so a single before/after comparison can’t tell a real regression from luck.

pybench removes the noise the right way:

Discovery Any function named bench_* that takes a seed and returns a score is a benchmark. No config file, no registration.
Paired by construction. The seeds sampled for the first (baseline) run are stored; every later run reuses the same seeds. Comparing identical conditions cancels seed-to-seed variance, so far fewer seeds detect the same regression.
An honest verdict. Each (step, metric) slot gets a one-sided paired t-test, and the whole benchmark is judged by a within-seed sign-flip permutation test. That test makes no independence assumption, so correlated metrics and steps don’t inflate false alarms.
CI-native. pybench exits non-zero when any benchmark regresses, so it drops straight into CI like pytest.

A benchmark returns one of three shapes:

def bench_a(seed): return 0.91                                   # scalar
def bench_b(seed): return {"accuracy": 0.91, "min:loss": 0.42}  # multiple metrics
def bench_c(seed):                                              # multi-step curve
    return [{"step": 1, "accuracy": 0.5, "min:loss": 1.0}, {"step": 10, "accuracy": 0.91, "min:loss": 0.42}]

Scores follow a higher-is-better convention. For metrics where lower is better (loss, error), prefix the key with min: and return the raw value — pybench flips the sign internally so that “a decrease in goodness is a regression.”

Quickstart¶

Install pybench:

uv add pybench        # or: pip install pybench

Write a bench_* function that takes a seed and returns a score:

# benchmarks/bench_model.py
def bench_accuracy(seed: int) -> float:
    return train_and_score(seed)

Then drive it from the CLI:

pybench            # 1st time: samples seeds, saves a baseline, marks NEW
pybench            # later: reruns on the same seeds, marks PASS / FAIL
pybench update         # re-baseline after an intended change
pybench show           # print current baseline stats  (--history for per-commit)

The three ways to invoke pybench:

Command	What it does	Writes to disk?
`pybench`	Discover, run, compare; exit 1 if any benchmark fails	Only the first time (baseline init)
`pybench update`	Re-run and overwrite the baseline (resamples fresh seeds)	Yes
`pybench show`	Print current baseline stats (`--history` for per-commit)	No

Per-benchmark settings are keyword-only defaults — no config file:

def bench_training(seed: int, *, n_seeds: int = 50, alpha: float = 0.01,
                   min_effect: float = 0.02, workers: int = 4) -> list[dict]:
    ...

Parameter	Default	Meaning
`n_seeds`	`30`	Seeds sampled for the baseline
`alpha`	`0.05`	Significance threshold
`min_effect`	`None`	Minimum relative drop to flag (suppress trivia)
`workers`	`1`	Parallel seed processes (keep `1` for GPU/serial)

The baseline lives at .pybench/baselines.jsonl (one line per benchmark). Commit it to git — do not gitignore it. History is delegated to git, and pybench show --history reconstructs the baseline at every commit that touched it.