U.S. flag An official website of the United States government

DomID: Deep Unsupervised Clustering Algorithms

Catalog of Regulatory Science Tools to Help Assess New Medical Devices 

DomID is a Python package offering a suite of unsupervised deep learning algorithms specifically designed for clustering medical image datasets. The primary goal is to identify subgroups that have not been previously annotated in a given image dataset.

Technical Description

DomID is a Python library that includes five clustering algorithms for medical images, four of which are advanced end-to-end deep learning methods, and one baseline two-step approach which utilizes a conventional clustering method on top of a trained neural network. The primary goal is to identify subgroups that have not been previously annotated in a given image dataset.

The package contains software implementations of the following algorithms:

VaDE: Variational Deep Embedding (Jiang et al., 2017) – an unsupervised clustering method based on variational inference that trains a deep neural network to represent data points in a lower-dimensional space while optimizing cluster assignments to group similar data points.

CDVaDE: Conditionally Decoded Variational Deep Embedding (Sidulova et al., 2023) – a modified version of VaDE, which is enhanced by using additional information, such as available image annotations, to guide the clustering towards the identification of previously annotated image subgroups.

DEC: Deep Embedding Clustering (Xie et al., 2016) – uses a deep neural network to simplify complex data into lower-dimensional representations and then groups similar data points based on these low-dimensional embeddings. Unlike VaDE and CDVaDE, DEC is not based on variational inference. 

SDCN: Structural Deep Clustering Network (Bo et al., 2020) – Combines Graph Convolutional Networks (GCNs) and Autoencoders (AEs) for clustering, improving performance when known associations between different images can be encoded in a graph. This package includes a modified SDCN algorithm that uses a novel batching strategy to handle large-scale digital pathlogy datasets efficiently (Sidulova et al., 2024).

AE+K-means: A two-stage approach where K-means clustering (a conventional clustering algorithm) is applied to the embedding space of a trained AE. This primarily serves as a baseline for performance comparisons.

All of these clustering algorithms include a feature extractor component, which can be either an Autoencoders (AE) or a Variational Autoencoder (VAE). The package provides multiple AE and VAE architectures to choose from and includes instructions for extending the package with custom neural network architectures or clustering algorithms.

Ready-to-use experiment tutorials in Jupyter notebooks are available for both the MNIST dataset and a large-scale digital pathology dataset. The code is highly modular, making it easy to use and extend.

Intended Purpose 

This tool is primarily designed for clustering image patches from large-scale digital pathology datasets in an unsupervised fashion (Sidulova et al., 2023, 2024). The implemented methods are also applicable to other types of medical images. This exploratory tool has the goal of identifying meaningful but previously unannotated subgroups within datasets of medical images or image patches. The identified subgroups can be used within training or testing processes for downstream AI/ML models. By identifying previously unrecognized subsets of image datasets, this tool can help evaluate and improve the generalizability and robustness of other downstream AI/ML models, such as classification, object detection, or segmentation models.

Testing

The implemented algorithms have been validated on multiple image datasets in a series of peer-reviewed publications.

(Sidulova et al., 2023) evaluates and compares the performance characteristics of DEC, VaDE, and CDVaDE on a HER2 digital pathology dataset. The results show that these methods can identify meaningful image subgroups, with CDVaDE particularly effective in identifying subgroups not associated with already known labels. Additionally, the study includes specifically designed experiments on a modified version of the MNIST dataset of handwritten digits, providing a controlled setting to demonstrate the models' capabilities and performance at a more granular level.

(Sidulova et al., 2024) investigates the application of deep clustering algorithms AE+K-means, DEC, and (modified) SDCN to digital pathology datasets, specifically endometrial biopsy whole-slide images (WSIs). Both DEC and SDCN successfully identified meaningful clusters of WSI patches, with SDCN, which incorporates spatial contextual information, demonstrating superior performance on the evaluated metrics.

Additionally, the DomID software package includes automated testing for all implemented algorithms. It uses a testing tool called Pytest to automate the software testing process. These tests can be run automatically whenever changes are made to the code, catching errors early and reducing the risk of introducing bugs or degradations.

Limitations

While the implemented algorithms can be used as an exploratory tool to uncover unannotated subgroups in a given dataset, developing specialized quantitative evaluation metrics for this unsupervised task is inherently difficult. In the published experiments, the metrics to evaluate the results of clustering were task- and dataset-specific. So, while we expect reasonable performance on other tasks or image datasets, it cannot be guaranteed in all cases.

In the SDCN algorithm, the knowledge graph was based on the specific domain knowledge extracted from WSI. Adopting different types of inter-patch relationship information could alter the graph's structure, potentially necessitating a different batching approach than that implemented in DomID.

Supporting Documentation

Tool website:

Documentation / User manual:

Tutorials and case studies in the form Jupyter notebooks: https://github.com/DIDSR/DomId/tree/main/notebooks

References:

Bo, D., Wang, X., Shi, C., Zhu, M., Lu, E., & Cui, P. (2020). Structural Deep Clustering Network. Proceedings of The Web Conference 2020, 1400–1410. https://doi.org/10.1145/3366423.3380214

Jiang, Z., Zheng, Y., Tan, H., Tang, B., & Zhou, H. (2017). Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1965–1972.

Sidulova, M., Kahaki, S. M. M., Hagemann, I. S., & Gossmann, A. (2024, in print). Contextual unsupervised deep clustering in digital pathology. Proceedings of the Conference on Health, Inference, and Learning.

Sidulova, M., Sun, X., & Gossmann, A. (2023). Deep Unsupervised Clustering for Conditional Identification of Subgroups Within a Digital Pathology Image Set. In H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, & R. Taylor (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 (Vol. 14227, pp. 666–675). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-43993-3_64

Xie, J., Girshick, R., & Farhadi, A. (2016). Unsupervised Deep Embedding for Clustering Analysis. In Proceedings of The 33rd International Conference on Machine Learning (Vol. 48, pp. 478–487). PMLR.

Contact

Tool Reference