Expected Calibration Error (ECE) and Brier score alongside accuracy, computed per-run with per-bin breakdown.
Rigorous comparison
Wilcoxon signed-rank test for paired model comparisons; Spearman rank correlation for confidence vs. accuracy. Both validated against scipy.stats reference values.
Zero heavy dependencies
Providers use raw fetch (no SDK). Statistics are pure TypeScript. The core package has no runtime dependencies.