崔萌
青岛阜外心血管病医院 急诊科
Automated ICD coding via machine learning that focuses on some specific diseases has been a hot topic. As one of the leading causes of death, coronary heart diseases (CHD) have seldom been specifically studied by related research, probably due to lack of data concretely targeting at the diseases. Based on Fuwai-CHD and MIMIC-III-CHD, which are a private dataset from Fuwai Hospital and the CHD-related subset of a public dataset named MIMIC-III respectively, this study aimed at automated CHD coding by a deep learning method, which mainly consists of three modules. The first is a B ERT variant module responsible for encoding clinical text. In the module, we fine-tuned BERT variants with masked language model on clinical text, and proposed a truncation method to tackle the problem that BERT variants generally cannot handle sequences containing more than 512 tokens. The second is a word2vec module for encoding code titles and the third is a label-attention module for integrating the embeddings of clinical text and code titles. In short, we named the method BW_att. We compared BW_att against some widely studied baselines, and found that BW_att performed best in most of the coding missions. Specifically, BW_att reached a Macro-F1 of 96.2% and a Macro-AUC of 98.9% for the top-100 most frequent codes in Fuwai-CHD, which covered 89.2% of the total code occurrences. When predicting the top-50 most frequent codes in MIMIC-III-CHD, BW_att reached a Macro-F1 of 40.5% and a Macro-AUC of 66.1%. Moreover, BW_att was capable of locating informative tokens from clinical text for predicting the target codes. In summary, BW_att can not only suggest CHD codes accurately, but also possess robust interpretability, hence has great potential in facilitating CHD coding in practice.
Heliyon 2023
BACKGROUND:Computer-assisted clinical coding (CAC) based on automated coding algorithms has been expected to improve the International Classification of Disease, tenth version (ICD-10) coding quality and productivity, whereas studies oriented to primary diagnosis auto-coding are limited in the Chinese context.OBJECTIVE:This study aims at developing a machine learning (ML) model for automated primary diagnosis ICD-10 coding.METHODS:A total of 71,709 admissions in Fuwai hospital were included to carry out this study, corresponding to 168 primary diagnosis ICD-10 codes. Based on clinical implications, two feature engineering methods were used to process discharge diagnosis and procedure texts into sequential features and sequential grouping features respectively by which two kinds of models were built and compared. One baseline model using one-hot encoding features was considered. Light Gradient Boosting Machine (LightGBM) was adopted as the classifier, and grid search and cross-validation were used to select the optimal hyperparameters. SHapley Additive exPlanations (SHAP) values were applied to give the interpretability of models.RESULTS:Our best prediction model was developed based on sequential grouping features. It showed good performance in the test phase with accuracy and macro-averaged F1 (Macro-F1) of 95.2% and 88.3% respectively. The comparison of the models demonstrated the effectiveness of the sequential information and the grouping strategy in boosting model performance (P-value < 0.01). Subgroup analysis of the best model on each individual code manifested that 91.1% of the codes achieved the F1 over 70.0%.CONCLUSIONS:Our model has been demonstrated its effectiveness for automated primary diagnosis coding in the Chinese context and its results are interpretable. Hence, it has the potential to assist clinical coders to improve coding efficiency and quality in Chinese inpatient settings.
International journal of medical informatics 2021