Abstract:
OBJECTIVE Aiming at the current vacancy of large language models (LLMs) in TCM evaluation, a TCM benchmark dataset is designed and constructed to comprehensively and objectively evaluate the mastery and reasoning performance of LLMs in TCM knowledge, providing scientific and reliable basis for optimizing the performance of LLMs in the field of TCM.
METHODS This benchmark includes 29 506 questions across 13 subjects, with data collected from standardized TCM exams and textbooks. Three general-purpose models (GPT-3.5, ChatGLM3, Baichuan) and five Chinese medical models (PULSE, BenTsao, HuatuoGPT2, BianQue2, ShenNong) were evaluated with answer prediction and answer reasoning tasks. The evaluation results were quantitatively assessed using metrics including accuracy, F1 score, BLEU, and Rouge.
RESULTS For the answer prediction task, Baichuan had the highest accuracy of 36.07% in single-choice questions, while ChatGLM3 achieved the highest accuracy of 18.96% and F1 score of 76.31% in multiple-choice questions. For the answer reasoning experiment, Baichuan scored highest on BLEU-1 with 24.71, while ChatGLM3 achieved the highest Rouge-1 score of 44.64.
CONCLUSION In this study, general LLMs performed slightly better than Chinese medical LLMs. Meanwhile, all models' accuracy on choice questions remained below 60%, reflecting the significant challenges and room for improvement that LLMs still face in the field of TCM.