A Study on Clinical Natural Language Processing Covering Clinical Texts and Patient Narratives

柴田大作


In the field of clinical natural language processing (NLP), two types of natural language data are employed for research. The first is electronic medical record (EMR) data. The EMR data have several unstructured texts, such as medical and nursing records, and radiology reports. The second is the patient's narrative data, for example, a dialogue between the patient and medical staff, and the conversation with patient’s family. The EMR data are collected in hospitals on a daily basis. They are utilized for clinical research and development of medical artificial intelligence. Furthermore, the narrative data can be utilized for preventive medicine, for example, health promotion, and early detection and treatment of diseases. However, there are several data-specific problems. For example, there are no conventional methods for utilizing the EMR data. The EMR data are generally written as unstructured free text; and the narrative data, although useful for preventive medicine, are not available in Japan. Therefore, in this study, we proposed three approaches to solve these problems.

The first approach is to develop methods for extracting pain information from clinical texts. Because the pain information produced by patients cannot be obtained from test values, medical staff input the information in clinical texts as free text via dialogue with patients. Hence, extraction of pain expressions based on rule-based algorithms from the texts is challenging. We proposed an annotation scheme for extraction of pain information based on deep learning from clinical texts. We revealed that deep learning could extract pain expressions from a record with high accuracy using our proposed annotation data.

The second approach is to develop a new clinical NLP system named Document-to-Table (Doc2Table), which converts Japanese radiology reports into a structured format. In the clinical field, radiology reports written by trained radiologists contain large quantities of patient information with considerable potential for primary and secondary use in research. However, these reports are in an unstructured free text format, and the manual extraction of information is laborious, expensive, and time-consuming. Therefore, there is a need for systems that can apply NLP techniques to automatically identify and extract important information from unstructured radiology reports, and provide the structured output. In this study, we described the development of the Doc2Table system and its application to the extraction and tabulation of information from unstructured Japanese radiology reports for lung cancer. We showed that Doc2Table can convert Japanese radiology reports into a tabular format with consistent accuracy.

The third approach is to construct a Japanese narrative corpus and develop methods for detecting dementia in the early stages based on machine learning. We constructed a Japanese narrative speech corpus consisting of narratives of patients with mild cognitive impairment (MCI) and healthy elderly (HE) and automatically classified them into patients with MCI and HE. We achieved consistent performance with linguistic features. The results indicate that patients who are in the early stages of dementia can be identified with consistent performance using the proposed methods.