U.S. flag An official website of the United States government

Calzone: Evaluating the Calibration of Probabilistic Models

Catalog of Regulatory Science Tools to Help Assess New Medical Devices 

This regulatory science tool is a software suite for evaluating the calibration of models with probabilistic output.

Technical Description

Calzone [1] is a Python library for evaluating the calibration of probabilistic models [2]. It implements a comprehensive suite of metrics, visualization methods, and statistical tests to assess how well predicted probabilities agree with observed outcomes. Given a representative dataset containing model predictions and ground-truth labels, Calzone computes calibration metrics together with corresponding confidence intervals.

The package includes widely used calibration metrics such as Expected Calibration Error (ECE) [3], Maximum Calibration Error (MCE), Hosmer-Lemeshow (HL) statistic [4], Integrated Calibration Index (ICI) [5], Spiegelhalter’s Z-statistics [6], and Cox’s slope/intercept test. Calzone also supports multiclass calibration assessment through reduction strategies such as one-vs-rest and top-class probability formulations. In addition, it provides methods for prevalence adjustment to address miscalibration arising from prevalence shift [7]. For visualization, the library generates reliability diagrams [8], allowing users to inspect calibration behavior across customizable probability bins.

Intended Purpose 

The Calzone package is intended to support the evaluation of probability outputs from AI-enabled models used in medical device applications. This may include, for example, computer-aided diagnosis (CADx) devices and risk predictors. It provides well-documented reference implementations of key metrics to quantify how well predicted probabilities correspond to true outcome frequencies.

Testing

Calzone was evaluated using simulated datasets and controlled calibration scenarios [1]. The implemented methods were tested for numerical correctness, computational stability, and consistency across a range of input settings. In addition, the statistical tests included in the package were assessed to confirm appropriate Type I and Type II error behavior under both calibrated and intentionally miscalibrated conditions.

Limitations

  • (Representative Datasets) Accurate calibration assessment depends on access to a dataset that is representative of the intended-use population. If the dataset contains incorrect labels, unaddressed prevalence shifts, or other forms of dataset shift, the resulting calibration metrics may not reflect true model performance and may therefore be misleading.
  • (Dependency Versions) Calzone depends on a small set of widely used scientific computing libraries, including NumPy, SciPy, Matplotlib, and statsmodels. Although these dependencies are stable and broadly supported, not all version combinations have been explicitly tested. Users are encouraged to report compatibility issues through the GitHub repository.

Supporting Documentation

The Calzone source code is hosted on GitHub. User documentation is available through Read the Docs, and example use cases are included in the documentation.

GitHub repository: https://github.com/DIDSR/calzone
Documentation: https://calzone-docs.readthedocs.io/en/latest/index.html

[1] K. L. Fan et al., “Calzone: A Python package for measuring calibration of probabilistic models for classification,” J. Open Source Softw., vol. 10, no. 114, p. 8026, Oct. 2025.

[2] B. Van Calster, M. Van Smeden, L. Wynants, and E. W. Steyerberg, “Calibration: the Achilles heel of predictive analytics,” BMC Med., vol. 17, no. 1, p. 230, Dec. 2019.

[3] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” PMLR, vol. 70, pp. 1321–1330, 2017.

[4] D. W. Hosmer and S. Lemesbow, “Goodness of fit tests for the multiple logistic regression model,” Commun. Stat.-Theory Methods, vol. 9, no. 10, pp. 1043–1069, 1980.

[5] P. C. Austin and E. W. Steyerberg, “The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models,” Stat. Med., vol. 38, no. 21, pp. 4051–4065, Sep. 2019.

[6] D. J. Spiegelhalter, “Probabilistic prediction in patient management and clinical trials,” Stat. Med., vol. 5, no. 5, pp. 421–433, 1986.

[7] W. Chen, B. Sahiner, F. Samuelson, A. Pezeshk, and N. Petrick, “Calibration of medical diagnostic classifier scores to the probability of disease,” Stat. Methods Med. Res., vol. 27, no. 5, pp. 1394–1409, May 2018.

[8] M. H. DeGroot and S. E. Fienberg, “The comparison and evaluation of forecasters,” J. R. Stat. Soc. Ser. Stat., vol. 32, no. 1–2, pp. 12–22, 1983.

Contact

Tool Reference