大语言模型在中医领域的标准化评估

Standardized Evaluation of Large Language Models in Traditional Chinese Medicine

  • 摘要:
    目的 针对目前大语言模型(LLMs)在中医学领域测评中的空缺,设计并构建一个中医学测评基准数据集,以对LLMs在中医学知识的掌握与推理表现进行全面、客观地评测,从而为LLMs在中医领域的性能优化提供科学、可靠的依据。
    方法 从中医标准化考试和教科书中收集数据,构建了一个涵盖13个学科共29 506道题的中医测评基准数据集。实验共选取了3个通用模型(GPT3.5、ChatGLM3、Baichuan)和5个中文医疗模型(PULSE、BenTsao、HuatuoGPT2、BianQue2、ShenNong),对它们在答案预测能力和答案推理能力进行全面评测。测评结果使用准确率、F1值、BLEU、Rouge等指标进行量化评估。
    结果 答案预测实验的结果显示,Baichuan在单项选题中准确率最高,为36.07%;ChatGLM3在多项选题中准确率和F1值最高,为18.96%和76.31%。答案推理实验的结果显示,Baichuan在BLEU-1分值最高,为24.71;ChatGLM3在Rouge-1分值最高,为44.64。
    结论 通用LLMs整体表现略优于中文医疗LLMs,同时所有模型在选择题上的准确率都未超过60%,反映出LLM在中医领域中仍面临巨大的挑战和提升空间。

     

    Abstract:
    OBJECTIVE Aiming at the current vacancy of large language models (LLMs) in TCM evaluation, a TCM benchmark dataset is designed and constructed to comprehensively and objectively evaluate the mastery and reasoning performance of LLMs in TCM knowledge, providing scientific and reliable basis for optimizing the performance of LLMs in the field of TCM.
    METHODS This benchmark includes 29 506 questions across 13 subjects, with data collected from standardized TCM exams and textbooks. Three general-purpose models (GPT-3.5, ChatGLM3, Baichuan) and five Chinese medical models (PULSE, BenTsao, HuatuoGPT2, BianQue2, ShenNong) were evaluated with answer prediction and answer reasoning tasks. The evaluation results were quantitatively assessed using metrics including accuracy, F1 score, BLEU, and Rouge.
    RESULTS For the answer prediction task, Baichuan had the highest accuracy of 36.07% in single-choice questions, while ChatGLM3 achieved the highest accuracy of 18.96% and F1 score of 76.31% in multiple-choice questions. For the answer reasoning experiment, Baichuan scored highest on BLEU-1 with 24.71, while ChatGLM3 achieved the highest Rouge-1 score of 44.64.
    CONCLUSION In this study, general LLMs performed slightly better than Chinese medical LLMs. Meanwhile, all models' accuracy on choice questions remained below 60%, reflecting the significant challenges and room for improvement that LLMs still face in the field of TCM.

     

/

返回文章
返回