U.S. flag An official website of the United States government

M-SYNTH: A Dataset for the Comparative Evaluation of Mammography AI

Catalog of Regulatory Science Tools to Help Assess New Medical Devices 


This regulatory science tool is a synthetic mammography dataset that includes a variety of breast densities, breast sizes, and inserted lesions imaged with different exposure levels. The dataset is intended to be used for the comparative evaluation of AI tools used in mammography.


Technical Description

M-SYNTH [1] is a dataset that consists of 45,000 synthetic digital mammography (DM) examples. We rely on the VICTRE pipeline [2] (see VICTRE Github Page and FDA Regulatory Science Tools (RST) Catalog for additional information) for generating breast models and their corresponding DM images. In silico breast models [3] (also known as breast imaging phantoms) were generated using a procedural analytic model which allows for adjusting various patient characteristics including breast shape, size and glandular density. Lesions were inserted in a subset to create the signal-present cohort. The resulting breast models were then imaged using a state-of-the-art Monte Carlo x-ray transport code (MC-GPU) [4]. The dataset can be used for comparative testing of algorithms designed for mammography analysis.

We studied breast densities of extremely dense (referred to as "dense"), heterogeneously dense (referred to as "hetero"), scattered, and fatty. For each breast density, a different breast size is used to correspond with population statistics. Therefore, the dense breast is the smallest, followed by heterogeneously dense, then scattered, and then fatty. Each breast model was compressed to 3.5 cm, 4.5 cm, 5.5 cm, and 6.0 cm for each respective density, mimicking the organ compression during the imaging. Random spiculated breast masses were generated with three different sizes (5 mm, 7 mm and 9 mm radii) and mass density was set to be a factor of glandular tissue density (1.0, 1.06 and 1.1 times). Note that for dense and hetero breasts, we only used mass sizes of 5 and 7 mm, since 9 mm masses do not fit within the breast region. No micro-calcification clusters were inserted. To create the signal-present cohort, a single spiculated mass was inserted in half of the cases at randomly chosen locations chosen from a list of candidate sites determined by the position of the terminal duct lobular units. M-SYNTH includes digital mammography (DM) images with 300 examples per cohort type (combination of breast density, mass radius, mass density, and relative dose). Each example consists of an image (in RAW and DICOM formats), image-level annotation, mass location, and a pixel-level segmentation of the mass.

The dataset has the following cohort characteristics:

  • Breast density: dense, heterogeneously dense, scattered, fatty
  • Mass radius (mm): 5.00, 7.00, 9.00
  • Mass density: 1.0, 1.06, 1.1 (ratio of radiodensity of the mass to that of fibroglandular tissue)
  • Relative dose: 20%, 40%, 60%, 80%, 100% of the clinically recommended dose for each density

Intended Purpose

M-SYNTH is a synthetic mammography dataset that can be used to develop (train or pre-train) or comparatively test AI algorithms for segmentation, detection, and/or classification of breast lesions, and evaluate the effect of mass size and density, breast density, and dose on AI performance in lesion detection. M-SYNTH cannot fully replace real patient data for the evaluation of mammography AI.


We first looked at moments spread (mean, variance, skewness, kurtosis, hyperskewness) obtained from the pixel intensities in synthetic and real patient data distributions to perform a qualitative comparison. Spread of the top five moments of pixel intensity distributions in M-SYNTH covered the distributions of real patient images (of the InBreast dataset [6]) and showed distinctions across the breast density subgroups, as expected. We then trained neural networks on M-SYNTH and real patient data on the task of mass detection (i.e., classify whether a DM image contains a mass) to perform a task-based assessment of M-SYNTH. Performance was then quantitatively evaluated using the area under curve (AUC) metric. Evaluation was treated as a multiple reader multiple case study, where an AI model is a single reader. Multiple readers are obtained by re-training the model with different random seeds and iMRMC software (see [5] and iMRMC RST) was used to identify associated confidence intervals.

Evaluation of the performance change across all the subgroups (when AI was trained and tested on M-SYNTH data) revealed that performance was consistent with findings from clinical practice. AUC improved with larger mass density and mass size, yet is impacted by the breast density, where mass detection performance is lowest in high-density breasts (dense) and highest in low-density breasts (fatty) in most of the cases. When the AI was trained with real patient data only (sourced from the public InBreast dataset [6]), performance dropped in general (due to the domain gap between synthetic and real data) but exhibited a similar set of comparative trends overall.


Testing with simulations is constrained to the variability captured by the parameter space of the object models for anatomy, pathology and the acquisition system. Thus, the complexity of the object model and acquisition system may need to be adjusted depending on the complexity of the questions to be investigated with simulated testing. A potential risk of testing using simulated data is missing the variability observed in patient populations.

There is a risk of misjudging model performance due to a domain gap between real and synthetic examples.

Supporting Documentation

The dataset is hosted on Huggingface (with version control enabled) with accompanying code available in Github.

Dataset: https://huggingface.co/datasets/didsr/msynth

Github: https://github.com/DIDSR/msynth-release

License: Creative Commons 1.0 Universal License (CC0)

  1. Elena Sizikova, Niloufar Saharkhiz, Diksha Sharma, Miguel Lago, Berkman Sahiner, Jana G. Delfino, Aldo Badano. Knowledge-based in silico models and dataset for the comparative evaluation of mammography AI for a range of breast characteristics, lesion conspicuities and doses. Advances in Neural Information Processing Systems (NeurIPS) 2023.
  2. Aldo Badano, Christian G Graff, Andreu Badal, Diksha Sharma, Rongping Zeng, Frank W. Samuelson, Stephen J. Glick, Kyle J. Myers. Evaluation of digital breast tomosynthesis as replacement of full-field digital mammography using an in silico imaging trial. JAMA Network Open 2018. 
  3. Christian G Graff. A new, open-source, multi-modality digital breast phantom. In Medical Imaging 2016: Physics of Medical Imaging, volume 9783, pages 72–81. SPIE, 2016. 
  4. Andreu Badal, Diksha Sharma, Christian G Graff, Rongping Zeng, and Aldo Badano. Mammography and breast tomosynthesis simulator for virtual clinical trials. Computer Physics Communications, 261:107779, 2021. 
  5. Brandon D. Gallas, Andriy Bandos, Frank W. Samuelson, and Robert F. Wagner. A framework for random-effects roc analysis: Biases with the bootstrap and other variance estimators. Communications in Statistics - Theory and Methods, 38(15):2586–2603, 2009. 
  6. Inês C Moreira, Igor Amaral, Inês Domingues, António Cardoso, Maria Joao Cardoso, and Jaime S Cardoso. Inbreast: toward a full-field digital mammographic database. Academic radiology, 19(2):236–248, 2012.


Tool Reference

  • In addition to citing relevant publications please reference the use of this tool using RST24AI06.01

For more information