API reference

Auto-generated from the package’s Google-style docstrings (via autodoc + napoleon). One section per module, in the order they compose a run.

pybench.discovery

Discover bench_* functions by importing Python files under a path.

exception pybench.discovery.DiscoveryError

Raised when discovery cannot satisfy the request.

class pybench.discovery.Benchmark(name, fn, config, file)

A discovered benchmark function and its resolved configuration.

Parameters:
  • name (str)

  • fn (Callable[[int], object])

  • config (BenchConfig)

  • file (Path)

pybench.discovery.import_file(file)

Import a Python file as an anonymous module.

Parameters:

file (Path) – Path to the .py file.

Returns:

The imported module object.

Raises:

DiscoveryError – If the file cannot be loaded as a module.

Return type:

ModuleType

pybench.discovery.discover(path, names=None)

Find bench_* functions defined under path.

A file is any .py file (recursively, when path is a directory). Only functions defined in the imported file are collected, so a bench_* imported from elsewhere is ignored.

Parameters:
  • path (Path) – A benchmark file or a directory to walk.

  • names (Sequence[str] | None) – If given, keep only these benchmark names (--bench).

Returns:

Benchmarks sorted by name.

Raises:

DiscoveryError – If path does not exist, a benchmark name is defined twice, or a requested names entry is not found.

Return type:

list[Benchmark]

pybench.normalizer

Coerce benchmark return values into a canonical scores mapping.

pybench.normalizer.Scores

{step: {metric: value}}.

Type:

Canonical normalized form

alias of dict[int, dict[str, float]]

exception pybench.normalizer.NormalizationError

Raised when a benchmark return value has an unsupported shape.

pybench.normalizer.normalize(result)

Coerce any accepted benchmark return value to canonical Scores.

Parameters:

result (object) – A float, dict of metrics, or list of step dicts.

Returns:

{step: {metric: value}}; scalars and bare dicts use step 0, and a bare scalar is stored under the metric name score.

Raises:

NormalizationError – If the value has an unsupported shape or type.

Return type:

dict[int, dict[str, float]]

pybench.validator

Alignment checks between a run’s scores and the stored baseline.

exception pybench.validator.StepKeyMismatchError(bench, current, baseline)

Raised when the run and baseline step-key sets differ.

Parameters:
  • bench (str)

  • current (set[int])

  • baseline (set[int])

Return type:

None

exception pybench.validator.MetricKeyMismatchError(bench, step, current, baseline)

Raised when the run and baseline metric-key sets differ for a step.

Parameters:
  • bench (str)

  • step (int)

  • current (set[str])

  • baseline (set[str])

Return type:

None

pybench.validator.validate_alignment(bench, current, baseline)

Assert the run and baseline share identical step and metric keys.

Parameters:
  • bench (str) – Benchmark name, for error messages.

  • current (Mapping[int, Mapping[str, object]]) – Freshly normalized run scores (keyed by step then metric).

  • baseline (Mapping[int, Mapping[str, object]]) – Stored baseline scores (keyed by step then metric).

Raises:
Return type:

None

pybench.runner

Run a benchmark over a set of seeds and collect per-seed scores.

pybench.runner.SeedScores

{step: {metric: [value for each seed]}}.

Type:

Per-seed raw scores

alias of dict[int, dict[str, list[float]]]

exception pybench.runner.RunShapeError

Raised when a benchmark returns different keys across seeds.

pybench.runner.sample_seeds(n, rng)

Sample n distinct-enough random integer seeds.

Parameters:
  • n (int) – Number of seeds to draw.

  • rng (Generator) – Random generator.

Returns:

A list of n Python ints in [0, 2**32).

Return type:

list[int]

pybench.runner.run_benchmark(bench, seeds, *, on_seed=None)

Run bench on each seed and collect aligned per-seed scores.

Runs serially when workers == 1; otherwise fans the seeds out across a process pool (each worker re-imports the benchmark file by path).

Parameters:
  • bench (Benchmark) – The benchmark to run.

  • seeds (list[int]) – Seeds to run, in order; output lists align position-by-position.

  • on_seed (Callable[[], None] | None) – Optional callback invoked once per completed seed (progress).

Returns:

Per-seed scores {step: {metric: [value for each seed]}}.

Raises:

RunShapeError – If a later seed yields different step/metric keys than the first.

Return type:

dict[int, dict[str, list[float]]]

pybench.stats

Statistical comparison: paired t-test slots + sign-flip permutation meta-test.

For each (step, metric) slot a one-sided paired t-test (in goodness space, i.e. after the min: sign flip) decides whether the current run regressed. The benchmark verdict is the within-seed sign-flip permutation p-value of a continuous severity statistic (see SPECIFICATIONS.md §3).

pybench.stats.SeedScores

{step: {metric: [value for each seed]}}.

Type:

Per-seed raw scores

alias of dict[int, dict[str, list[float]]]

class pybench.stats.SlotResult(step, metric, baseline_mean, baseline_std, current_mean, current_std, effect_size, p_value, flagged, denom_at_floor)

Comparison outcome for one (step, metric) slot, in raw units.

Parameters:
  • step (int)

  • metric (str)

  • baseline_mean (float)

  • baseline_std (float)

  • current_mean (float)

  • current_std (float)

  • effect_size (float)

  • p_value (float)

  • flagged (bool)

  • denom_at_floor (bool)

denom_at_floor: bool

True when the baseline mean is so small that effect_size is unreliable.

class pybench.stats.Comparison(slots, n_flagged, n_slots, meta_p, passed)

Full benchmark comparison result.

Parameters:
  • slots (list[SlotResult])

  • n_flagged (int)

  • n_slots (int)

  • meta_p (float)

  • passed (bool)

pybench.stats.check_alpha_detectable(n_seeds, alpha)

Reject an alpha that no regression could ever satisfy.

The within-seed sign-flip meta-test has only 2**n_seeds arrangements, so the smallest achievable meta_p is 1 / 2**n_seeds. When alpha <= 1 / 2**n_seeds the verdict meta_p < alpha is unsatisfiable — even a maximally severe regression yields a PASS — so flag it loudly rather than report a vacuous green.

Raises:

ValueError – If alpha is unreachable at this seed count.

Parameters:
  • n_seeds (int)

  • alpha (float)

Return type:

None

pybench.stats.compare(baseline, current, *, alpha=0.05, min_effect=None, n_perm=4096, rng=None)

Compare a paired current run against a baseline.

Parameters:
  • baseline (dict[int, dict[str, list[float]]]) – Stored per-seed baseline scores.

  • current (dict[int, dict[str, list[float]]]) – Per-seed current scores, on the same seeds (paired).

  • alpha (float) – Per-slot and overall significance threshold.

  • min_effect (float | None) – Optional minimum relative goodness drop to flag a slot.

  • n_perm (int) – Number of sign-flip permutations for the meta-test.

  • rng (Generator | None) – Random generator; a fresh default one is used when None.

Returns:

A Comparison with per-slot detail and the overall verdict.

Raises:

ValueError – If baseline and current have mismatched seed counts, or if alpha is unreachable at this seed count (see check_alpha_detectable()).

Return type:

Comparison

pybench.store

Read and write the JSONL baseline store.

pybench.store.SeedScores

{step: {metric: [value for each seed]}}.

Type:

Per-seed raw scores

alias of dict[int, dict[str, list[float]]]

class pybench.store.BaselineRecord(bench, timestamp, git_commit, git_dirty, seeds, scores)

One benchmark’s stored baseline.

Parameters:
  • bench (str)

  • timestamp (str)

  • git_commit (str | None)

  • git_dirty (bool | None)

  • seeds (list[int])

  • scores (dict[int, dict[str, list[float]]])

pybench.store.parse_baselines(text)

Parse JSONL baseline content into records keyed by benchmark name.

Parameters:

text (str) – Raw JSONL content (e.g. a file’s text or git show output).

Returns:

Mapping of benchmark name to record; blank lines are skipped.

Return type:

dict[str, BaselineRecord]

pybench.store.read_baselines(path)

Load all baseline records keyed by benchmark name.

Parameters:

path (Path) – Path to the JSONL store.

Returns:

Mapping of benchmark name to record; empty if the file is absent.

Return type:

dict[str, BaselineRecord]

pybench.store.write_baselines(path, records)

Rewrite the JSONL store with the given records, one line each.

Parameters:
  • path (Path) – Path to the JSONL store; parent directories are created.

  • records (Iterable[BaselineRecord]) – Records to write (full rewrite).

Return type:

None

pybench.git

Capture git provenance (short SHA + dirty flag) with graceful fallback.

class pybench.git.GitInfo(commit, dirty)

Git provenance recorded with a baseline write.

Parameters:
  • commit (str | None)

  • dirty (bool | None)

pybench.git.git_metadata(cwd=None)

Return the short HEAD SHA and dirty flag, or nulls if git is unavailable.

Parameters:

cwd (Path | None) – Directory to inspect; defaults to the current working directory.

Returns:

GitInfo(commit, dirty). Both are None when cwd is not a git repository or git is not installed.

Return type:

GitInfo

pybench.git.file_history(path)

Return commits that touched path, oldest first.

Parameters:

path (Path) – File whose history to inspect (git is run in its directory).

Returns:

[(short_sha, date), ...] chronological, [] if the file has no commits, or None if not a git repo / git is unavailable.

Return type:

list[tuple[str, str]] | None

pybench.git.file_at_commit(commit, path)

Return the content of path as of commit via git show.

Parameters:
  • commit (str) – Commit-ish (e.g. a short SHA).

  • path (Path) – File to read; git is run in its directory.

Returns:

The file’s text at that commit, or None if git is unavailable or the path did not exist there.

Return type:

str | None

pybench.config

Per-benchmark configuration: keyword defaults plus CLI overrides.

class pybench.config.BenchConfig(n_seeds=30, alpha=0.05, min_effect=None, workers=1)

Resolved configuration for one benchmark.

Parameters:
  • n_seeds (int)

  • alpha (float)

  • min_effect (float | None)

  • workers (int)

pybench.config.extract_config(fn)

Read a benchmark’s keyword-only config defaults from its signature.

Parameters:

fn (Callable[[...], object]) – The bench_* function to inspect.

Returns:

A BenchConfig; any of n_seeds, alpha, min_effect, workers not declared on fn keep their package default.

Return type:

BenchConfig

pybench.config.apply_overrides(config, *, alpha=None, min_effect=None)

Return config with non-None CLI overrides applied.

Parameters:
  • config (BenchConfig) – The benchmark’s resolved configuration.

  • alpha (float | None) – CLI --alpha override, or None to keep the benchmark’s.

  • min_effect (float | None) – CLI --min-effect override, or None to keep it.

Returns:

A new BenchConfig with the overrides merged in.

Return type:

BenchConfig

pybench.reporter

Terminal output for a benchmark run (Rich, colored).

class pybench.reporter.BenchOutcome(name, status, n_steps, n_metrics, comparison)

One benchmark’s result, ready to render.

Parameters:
  • name (str)

  • status (str)

  • n_steps (int)

  • n_metrics (int)

  • comparison (Comparison | None)

pybench.reporter.report(console, outcomes, *, elapsed, verbose)

Render the full run report.

Parameters:
  • console (Console) – Rich console to write to.

  • outcomes (list[BenchOutcome]) – One outcome per benchmark, in display order.

  • elapsed (float) – Wall-clock seconds for the whole run.

  • verbose (bool) – Expand the per-slot table under each failing benchmark.

Return type:

None

pybench.reporter.report_update(console, updated)

Render the summary of a pybench update.

Parameters:
  • console (Console) – Rich console to write to.

  • updated (list[tuple[str, int]]) – (name, n_seeds) for each rewritten benchmark.

Return type:

None

pybench.reporter.report_show(console, records)

Render the current baseline stats for each benchmark.

Parameters:
  • console (Console) – Rich console to write to.

  • records (dict[str, BaselineRecord]) – Baseline records keyed by benchmark name.

Return type:

None

pybench.reporter.report_history(console, history)

Render per-benchmark baseline history across commits.

Parameters:
  • console (Console) – Rich console to write to.

  • history (dict[str, list[tuple[str, str, BaselineRecord]]]) – {bench: [(short_sha, date, record), ...]} chronological.

Return type:

None

pybench.cli

Click-based CLI entry point.