基于中医舌图的多层次特征融合中医体质辨识研究

杨磊; 王天舒; 杨涛; 胡孔法

doi:10.14148/j.issn.1672-0482.2026.0627

基于中医舌图的多层次特征融合中医体质辨识研究

Study on TCM Constitution Identification Based on Multi-Level Feature Fusion of TCM Tongue Images

摘要

摘要:
目的融合舌诊图像与文本描述的多模态特征，构建一种层次化融合的深度学习模型，实现中医体质辨识。
方法利用大型预训练语言模型生成对应的舌诊文本描述，构建包含945个样本的多模态舌诊数据集。采用ResNet50提取舌诊图像特征，结合BERT编码文本语义信息，构建分层融合模型TCM-DFM，在低维特征空间采用门控机制实现视觉-语义自适应加权，在高维语义空间利用跨模态注意力建立病理特征关联，通过动态决策融合机制整合单模态与多模态预测结果。在包含六类中医体质标签的数据集上，对比早期融合、晚期融合等基线方法，以准确率、精确率、召回率、F1值、混淆矩阵等指标评估模型性能。
结果 TCM-DFM模型的体质辨识准确率、精确率、召回率、F1值分别为84.52%、82.54%、84.52%、83.39%，在各项指标上均优于对比模型。多模态融合方法对比中，GCAF准确率达83.33%，较最佳单模态模型提升23.81%。消融实验证实门控机制与注意力模块的协同贡献。可视化分析表明模型聚焦舌体轮廓关键区域，符合中医望舌形诊断逻辑。
结论本文模型有效融合舌诊图像与描述文本信息，克服了单模态分析及传统融合方法的局限性，提升了体质分类准确率，进一步验证了舌象特征在中医体质辨识中的关键作用。

Abstract:
OBJECTIVE To integrate multimodal features from tongue images and textual descriptions, constructing a hierarchically fused deep learning model for Traditional Chinese Medicine (TCM) constitution identification.
METHODS Corresponding tongue diagnosis texts were generated using a large pre-trained language model, forming a multimodal dataset of 945 samples. The proposed TCM-DFM model employed ResNet50 to extract image features and BERT to encode text semantics. A gating mechanism was used in the low-dimensional feature space to achieve visual-semantic adaptive weighting, and cross-modal attention was used in the high-dimensional semantic space to establish pathological feature associations. A dynamic decision fusion mechanism was used to integrate the prediction results of unimodal and multimodal models. On a dataset containing six TCM constitution labels, the model performance was compared with baseline methods such as early fusion and late fusion, and the model performance was evaluated by metrics such as accuracy, precision, recall, F1 score, and confusion matrix.
RESULTS The TCM-DFM model achieved an accuracy of 84.52%, precision of 82.54%, recall of 84.52%, and F1-score of 83.39%, outperforming all baseline models. In the comparison of multimodal fusion methods, the method of GCAF reached 83.33% accuracy, a 23.81% gain over the best unimodal model. Ablation tests verified the synergistic effects of the gating and attention mechanisms. Visualization showed the model concentrated on clinically key tongue regions, aligning with TCM “inspecting tongue shape” principles.
CONCLUSION The proposed model effectively integrates information from tongue images and textual descriptions, overcoming limitations of unimodal analysis and conventional fusion methods. It significantly improves the accuracy of constitution classification and underscores the essential role of tongue diagnosis in TCM constitution identification.

HTML全文

参考文献(18)

施引文献

资源附件(0)