No CrossRef data available.
Article contents
26 Evaluating AI models trained with varying amounts of expert feedback for chronic graft-versus-host disease skin assessment in photos of patients with diverse skin tones
Published online by Cambridge University Press: 11 April 2025
Abstract
Objectives/Goals: Manual skin assessment in chronic graft-versus-host disease (cGVHD) can be time consuming and inconsistent (>20% affected area) even for experts. Building on previous work we explore methods to use unmarked photos to train artificial intelligence (AI) models, aiming to improve performance by expanding and diversifying the training data without additional burden on experts. Methods/Study Population: Common to many medical imaging projects, we have a small number of expert-marked patient photos (N = 36, n = 360), and many unmarked photos (N = 337, n = 25,842). Dark skin (Fitzpatrick type 4+) is underrepresented in both sets; 11% of patients in the marked set and 9% in the unmarked set. In addition, a set of 20 expert-marked photos from 20 patients were withheld from training to assess model performance, with 20% dark skin type. Our gold standard markings were manual contours around affected skin by a trained expert. Three AI training methods were tested. Our established baseline uses only the small number of marked photos (supervised method). The semi-supervised method uses a mix of marked and unmarked photos with human feedback. The self-supervised method uses only unmarked photos without any human feedback. Results/Anticipated Results: We evaluated performance by comparing predicted skin areas with expert markings. The error was given by the absolute difference between the percentage areas marked by the AI model and expert, where lower is better. Across all test patients, the median error was 19% (interquartile range 6 – 34) for the supervised method and 10% (5 – 23) for the semi-supervised method, which incorporated unmarked photos from 83 patients. On dark skin types, the median error was 36% (18 – 62) for supervised and 28% (14 – 52) for semi-supervised, compared to a median error on light skin of 18% (5 – 26) for supervised and 7% (4 – 17) for semi-supervised. Self-supervised, using all 337 unmarked patients, is expected to further improve performance and consistency due to increased data diversity. Full results will be presented at the meeting. Discussion/Significance of Impact: By automating skin assessment for cGVHD, AI could improve accuracy and consistency compared to manual methods. If translated to clinical use, this would ease clinical burden and scale to large patient cohorts. Future work will focus on ensuring equitable performance across all skin types, providing fair and accurate assessments for every patient.
- Type
- Informatics, AI and Data Science
- Information
- Creative Commons
- This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is unaltered and is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use or in order to create a derivative work.
- Copyright
- © The Author(s), 2025. The Association for Clinical and Translational Science