Hostname: page-component-745bb68f8f-f46jp Total loading time: 0 Render date: 2025-01-13T02:24:20.695Z Has data issue: false hasContentIssue false

Repeatability, Reproducibility, and Diagnostic Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage Using the Simple Triage and Rapid Treatment (START) Protocol

Published online by Cambridge University Press:  31 October 2024

Jeffrey Michael Franc
Affiliation:
University of Alberta, Edmonton, AB, Canada Universita’ del Piemonte Orientale, Novara, NO, Italy
Atilla Hertelendy
Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Lenard Cheng
Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Ryan Hata
Affiliation:
Beth Israel Deaconess Medical Center, Boston, MA, USA
Manuela Verde
Affiliation:
Universita’ del Piemonte Orientale, Novara, NO, Italy
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.
Objective

The release of ChatGPT in November 2022 drastically lowered the barrier to artificial intelligence with an intuitive web-based interface to a large language model. This study addressed the research problem: “Can ChatGPT adequately triage simulated disaster patients using the Simple Triage and Rapid Treatment (START) tool?”

Methods

Five trained disaster medicine physicians developed nine prompts. A Python script queried ChatGPT Version 4 with each prompt combined with 391 validated patient vignettes. Ten repetitions of each combination were performed: 35190 simulated triages.

Results

A valid START score was returned In 35102 queries (99.7%). There was considerable variability in the results. Repeatability (use of the same prompt repeatedly) was responsible for 14.0% of overall variation. Reproducibility (use of different prompts) was responsible for 4.1% of overall variation. Accuracy of ChatGPT for START was 61.4% with a 5.0% under-triage rate and a 33.6% over-triage rate. Accuracy varied by prompt between 45.8% and 68.6%.

Conclusions

This study suggests that the current ChatGPT large language model is not sufficient for triage of simulated patients using START due to poor repeatability and accuracy. Medical practitioners should be aware that while ChatGPT can be a valuable tool, it may lack consistency and may provide false information.

Type
Abstract
Copyright
© The Author(s), 2024. Published by Cambridge University Press on behalf of Society for Disaster Medicine and Public Health, Inc.
Supplementary material: File

Franc et al. supplementary material

Franc et al. supplementary material
Download Franc et al. supplementary material(File)
File 289.4 KB