Converting Noisy or Long sentences into Readable Sentences

高橋 いつみ


In this thesis, we propose a method of converting noisy sentences and long documents into comprehensible sentences. There are three main tasks we tackled in this thesis.

The first task is the joint estimation of word-level normalization and morphological analysis. In this study, we proposed a novel method of analyzing non-standard tokens. We achieved higher accuracy and recall for word segmentation, POS tagging, and normalization than those of conventional systems. Besides, we proposed a novel method for automatically extracting pairs of a variant word and its standard form from the unannotated text. We incorporated the acquired variant-normalization pairs into Japanese morphological analysis, and it improved the recall of normalization.

The second task is the sentence-level normalization. We proposed simple but effective methods of data augmentation for encoder-decoder-based neural normalization models and achieved higher BLEU scores than that of baselines.

The third task is the document summarization. We propose a new prototype-guided length-controllable abstractive summarization model to generate comprehensive texts from a long document. Our model first extracts a prototype text and creates a summary by jointly encoding and copying words from both the prototype text and source text. Experiments with the CNN/Daily Mail dataset and the NEWSROOM dataset show that our model outperformed previous models in standard and length-controlled settings.