Electronic Medical Records (EMRs) have been widely adopted to record patient’s medical progress. They have improved clinical documentation and decision support because they provide a coordinated, quick, and efficient access to patient records. However, the electronic format of the EMRs encourages copy-and-paste and use of templates. Copy-and-paste introduces redundancy which reduces the quality of the EMR data and makes it difficult to extract relevant information for decision making. Therefore, there is need to minimize redundancy so as to improve the quality of collected EMR data and make clinical decision making easier and efficient. One method for detecting redundant information is to compute the degree of semantic equivalence between clinical texts to remove texts which are highly equivalent. Semantic textual similarity (STS) tasks have been widely studied in the general English domain, however, there exists very few resources for STS tasks in the clinical domain and low resource languages.
In this study we focus on the problem of detecting redundancy in English and Japanese clinical texts. For English texts, we use the data provided in the 2019n2c2/OHNLP Clinical STS shared task. For the case of Japanese texts, we create a publicly available dataset for Japanese clinical STS. We also investigate the level of redundancy in Japanese medical documents. In our corpus we find an average redundancy level of about 36% which is close to the redundancy level found in French corpus (33%), and much lower than found in American corpus (upto 78%). Moreover, we solve a Clinical STS task, i.e., given a pair of two sentences, the objective is to compute their degree of semantic similarity on a scale 0 to 5. 0 means that the two sentences are completely dissimilar, i.e., their meanings do not over-lap, and 5 means that the sentence pairs are completely similar, semantically. We adopt a BERT-based model since BERT has set state-of-the-art performance in various NLP tasks. Our experimental results show that the model can effectively detect redundancy in medical documents. We achieve a high Pearson correlation score of 0.882 in the English corpus, and 0.904 and 0.875 in two Japanese corpora.
Keywords: Natural Language Processing (NLP), Semantic Textual Similarity (STS), Elec-tronic Medical Records (EMR), Bidirectional Encoder Representations from Trans-formers (BERT)