reliability-evalCalibration-first LLM evaluation

Not just "did the model get it right?" but "can you trust how confident it was?"

Calibration metrics

Expected Calibration Error (ECE) and Brier score alongside accuracy, computed per-run with per-bin breakdown.

Rigorous comparison

Wilcoxon signed-rank test for paired model comparisons; Spearman rank correlation for confidence vs. accuracy. Both validated against scipy.stats reference values.

Zero heavy dependencies

Providers use raw fetch (no SDK). Statistics are pure TypeScript. The core package has no runtime dependencies.