コロキアムB発表

日時: 9月16日(水)2限(11:00~12:30)


会場: L1

司会: 畑 秀明
福田 りょう M, 2回目発表 知能コミュニケーション 中村 哲, 渡辺 太郎, 須藤 克仁
title: Pseudo spoken language generation for machine translation
abstract: Machine translation of "spoken language" that we routinely use in conversations is difficult because of a lack of parallel data. Most of the available parallel data is “written language” used when writing a sentence, and there is very little parallel data of spoken language. To solve this problem, we can consider the fine-tuning approach that adapts the “written language” domain to the “spoken language” domain, but the difference between spoken language and written language becomes a problem. The only difference between spoken language and written language is the scene where it is used, but there are significant differences in trends. For example, Japanese spoken languages tend to have simpler vocabulary than written language, have ambiguous sentence boundaries, and include fillers such as "ええと". These differences are a barrier to domain adaptation from written to spoken language. In this study, we aim to absorb these differences by transferring written language into pseudo-spoken language and enable effective domain adaptive learning.
language of the presentation: Japanese
発表題目: 書き言葉から話し言葉へのスタイル変換による話し言葉機械翻訳の精度向上
発表概要: 我々が会話で日常的に用いる「話し言葉」の機械翻訳は難しく、その理由は学習に必要な対訳データの少なさにある。利用可能な対訳データの多くは、文章を書くときに用いられるような「書き言葉」であり、話し言葉の対訳データは非常に少ないと言える。このような対訳データの少ない領域における機械翻訳手法として、ドメイン適応が知られている。しかし、話し言葉の翻訳をドメイン適応によって学習することには多数の困難が存在する。話し言葉と書き言葉の明確な差異は「使用場面」のみであるが、その傾向には大きな違いが見られる。例えば日本語の話し言葉には、書き言葉に比べて表れる語彙が平易である、文の境界が曖昧である、「ええと」といった言い澱みが含まれる、などの傾向がある。これらの差異は、書き言葉から話し言葉へのドメイン適応の障壁となる。本研究では、書き言葉を擬似的な話し言葉に変換することでこの差異を吸収し、効果的なドメイン適応学習を可能にすることを目指す。
 
安本 玄樹 M, 2回目発表 知能コミュニケーション 中村 哲, 渡辺 太郎, 須藤 克仁
title: When and What Context Does Neural Machine Translation Require?
abstract: Recent developments in neural machine translation have been remarkable. For further improvements, many researchers are trying to make context-aware neural machine translation models. In order to obtain contextual information, translation models are becoming more complicated. However, it has not been fully discussed when or what context is required for neural machine translation. We are trying to analyze them with a pilot study. In this pilot study, we translated documents with some different context sentences (as a context sentence, we used hand-crafted sentence, selected sentence from the document, or previous sentence). We would like to show how context affects the translation.
language of the presentation: Japanese
 
高井 公一 D, 中間発表 知能コミュニケーション 中村 哲, 渡辺 太郎, 須藤 克仁, 安田 圭志
title: Automatic method to build proper nouns for Translation Systems
abstract: Mis-translation or dropping of proper nouns reduces the quality of machine translation or speech translation output. In this work, we aimed at improvement of machine translation quality by finding out-of-vocabulary and automatically install to machine translation system.
There are mainly two research topics to achieve the goal. First topic is named-entity-recognition (NER) for machine translation systems. We focused on proper noun as out-of-vocabulary problem. In our investigation, when named entity is correctly annotated, machine translation system yields good translations. In first research, we trained NER system by using JParaCrawl corpus. In the corpus, there are 25,420 Japanese-English sentence pairs which contain proper noun. By using the data, we carry out the several unsupervised NER experiments. New proper nouns were registered to the dictionary of machine translation system based on each NER results. Then, we evaluated translation results by using BLEU score. According to the experimental results, one of the methods gives the comparable performance to human’s annotation on automatic translation quality evaluation.
Second research topic is proper noun translation. We should translate newly detected proper nouns from source language to be able to understand them in target language. For the purpose, we collect about 700,000 Japanese-English word pairs. This parts is ongoing research, so research progress and future work will be introduced.
language of the presentation: Japanese