U.S. flag An official website of the United States government

SegAgree: Statistical Assessment of Agreement in Overlap-Based Performance Between an AI Segmentation Device and a Multi-Expert Human Panel

Catalog of Regulatory Science Tools to Help Assess New Medical Devices 

This regulatory science tool is a lab method that can be used for the assessment of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel.

Technical Description

TheSegAgree tool implements a statistical segmentation interchangeability framework that directly compares device-to-expert dissimilarity with expert-to-expert dissimilarity. The tool takes image-level pairwise device-expert and expert-expert Dice similarity coefficient scores as input and outputs the mean Dice difference with its associated 95% confidence interval. It quantifies the agreement of device segmentations with multiple human experts’ segmentations without requiring a reference standard or the need for defining predefined cutoffs as needed with traditional evaluation metrics, providing a comprehensive characterization of device–panel interchangeability and supporting objective interpretation, particularly when traditional evaluation approach yields borderline performance for which interpretation can be ambiguous. 

Traditional segmentation evaluation compares AI outputs against a reference standard aggregated from an expert panel using metrics such as Dice, but clinically meaningful cutoffs for these metrics are lacking, making objective performance targets difficult to define and borderline results hard to interpret.

Intended Purpose 

This tool is intended to support performance evaluation of medical imaging AI segmentation devices by quantifying device–panel interchangeability and supporting objective interpretation, especially when traditional evaluation approach yields borderline performance.

Applicable Medical Devices:

  • Medical imaging based AI/ML segmentation devices  
  • Computer-assisted diagnostic software for lesion segmentation
  • Radiological image processing systems with segmentation capabilities
  • Automated segmentation tools for surgical planning and radiation therapy

Testing

The evaluation testing and usage of this tool has been demonstrated in the following article (Supporting Document 1 referenced below), which describes  the method's capability to avoid false claims of disagreement when the device and expert panel perform similarly, and to correctly detect true disagreement when performance differs. Testing included statistical simulations across comprehensive scenarios with varying overlap-based performance levels, and image-based simulations using the Medical Image Segmentation Synthesis (MISS) Tool (Supporting Document 2 referenced below) to generate synthetic contours with different transformation parameters. 

Limitations

The scope of the method is currently limited to medical imaging segmentation only, not other tasks. Limitations of the method (treating reader effect as fixed, designed to assess differences in overlap-based segmentation performance not distance-based or other types of performance)  are also presented in the discussions section of Supporting Document 1.

Supporting Documentation

  1. Hu T, Sahiner B, Guan S, Mikailov M, Cha K, Samuelson F, Petrick N. Statistical testing of agreement in overlap-based performance between an AI segmentation device and a multi-expert human panel without requiring a reference standard. J Med Imaging (Bellingham). 2025 Sep;12(5):055003. https://doi.org/10.1117/1.JMI.12.5.055003. Epub 2025 Oct 22. PMID: 41132782; PMCID: PMC12543030.
  2. Guan S, Samala R, Arab A, Chen W. "MISS-tool: medical image segmentation synthesis tool to emulate segmentation errors", Proc. SPIE 12465, Medical Imaging 2023: Computer-Aided Diagnosis, 1246518 (7 April 2023). https://doi.org/10.1117/12.2653650
  3. User Manual: https://github.com/DIDSR/seg-agreement

Contact

Tool Reference