# MNIST example

The MNIST example trains a small Flax NNX MLP and reports a multi-step
accuracy / loss curve — a genuinely noisy metric, which is exactly where the
paired-seed design earns its keep.

```python
def bench_mnist_mlp(seed: int, *, n_seeds: int = 5, workers: int = 1) -> list[dict]:
    """Train an MLP on MNIST; report loss/accuracy at fixed step checkpoints."""
    ...
    lr = 1e-3
    batch_size = 128
    # A toggle to simulate an improvement: a wider hidden layer lifts accuracy.
    hidden = 512 if os.environ.get("PYBENCH_MNIST_WIDE") else 128
    checkpoints = (200, 1000)
    n_classes = 10

    model = MLP(x_train.shape[1], hidden, n_classes, rngs=nnx.Rngs(seed))

    # A toggle to simulate a regression: Adam (the default) trains well; set
    # PYBENCH_MNIST_SGD to fall back to SGD, which barely moves at this rate.
    if os.environ.get("PYBENCH_MNIST_SGD"):
        optimizer = nnx.Optimizer(model, optax.sgd(lr), wrt=nnx.Param)
    else:
        optimizer = nnx.Optimizer(model, optax.adam(lr), wrt=nnx.Param)
    ...
    return [
        {"step": 200,  "min:train_loss": 0.33, "accuracy": 0.92},
        {"step": 1000, "min:train_loss": 0.17, "accuracy": 0.96},
    ]
```

Default parameters are `n_seeds=5` and `workers=1` (a single device holds the
model, so parallel seed processes don't apply).

## A full lifecycle with Git

Pull the example's stack on demand, then establish a baseline with the
well-trained (Adam) model:

```bash
uv sync --package mnist                                       # JAX/Flax/datasets
uv run --package mnist pybench examples/mnist/benchmarks/
#   bench_mnist_mlp   .......... NEW   2 metrics × 2 steps   (baseline saved)
#   ──────────────────────────────────────────────────────────────
#   0 failed, 0 passed, 1 new  in 32s

git add .pybench/baselines.jsonl && git commit -m "baseline: mnist mlp"
```

### A regression is caught

A bad change regresses the model. Here we set `PYBENCH_MNIST_SGD`, so training
falls back to plain SGD. Re-running reuses the **same 5 seeds** — a paired
comparison:

```bash
PYBENCH_MNIST_SGD=1 uv run --package mnist pybench examples/mnist/benchmarks/ -v
#   bench_mnist_mlp   .......... FAIL   2 metrics × 2 steps   4/4 slots flagged
#   meta-p=0.031
#   metric     step   baseline    current      Δ        p
#   accuracy    200   0.92±0.00   0.23±0.04   -74.7%   0.000  ✗
#   min:train_loss   200   0.25±0.04   2.22±0.03   -778.4%   0.000  ✗
#   accuracy   1000   0.96±0.00   0.65±0.03   -32.1%   0.000  ✗
#   min:train_loss  1000   0.12±0.03   1.78±0.04   -1440.9%   0.000  ✗
#   ──────────────────────────────────────────────────────────────
#   1 failed, 0 passed, 0 new  in 32s         # → exit code 1
```

`pybench` exits non-zero, failing CI like a broken `pytest`. Since this
regression is a mistake, you simply fix the code (drop the SGD path; no
rebaseline) and the next run goes green against the unchanged baseline.

### An improvement is accepted

Now a *good* change: a wider hidden layer (`PYBENCH_MNIST_WIDE`) genuinely lifts
accuracy. pybench tests for regressions one-sidedly, so a run that only gets
better passes:

```bash
PYBENCH_MNIST_WIDE=1 uv run --package mnist pybench examples/mnist/benchmarks/
#   bench_mnist_mlp   .......... PASS   2 metrics × 2 steps   0/4 slots flagged
#   meta-p=1.000
#   ──────────────────────────────────────────────────────────────
#   0 failed, 1 passed, 0 new  in 32s         # → exit code 0
```

> **pybench only flags regressions — judging whether a run is genuinely better
> is up to you.**

Assuming the improvement is real, the baseline still holds the *old, lower*
numbers, so a later regression would only be measured against the weaker model.
Lock the gain in as the new bar with `update`, then inspect the trail with
`show --history`:

```bash
PYBENCH_MNIST_WIDE=1 uv run --package mnist pybench update examples/mnist/benchmarks/ --yes
git add .pybench/baselines.jsonl && git commit -m "rebaseline: wider MLP"

uv run --package mnist pybench show --history
#   bench_mnist_mlp
#     888483b  2026-06-25  accuracy@200: 0.92  min:train_loss@200: 0.33  accuracy@1000: 0.96  min:train_loss@1000: 0.17
#     468717e  2026-06-25  accuracy@200: 0.93  min:train_loss@200: 0.23  accuracy@1000: 0.97  min:train_loss@1000: 0.09
```

The accuracy bar ratchets up across the two baselines (0.92 → 0.93 at step 200,
0.96 → 0.97 at step 1000): any future regression is now measured against the
better model.