コロキアムB発表

日時: 06月18日 (火) 3限目(13:30-15:00)


会場: L1

司会: Kan Yirong
ALI IQRA M, 2回目発表 自然言語処理学 渡辺 太郎, 荒牧 英治, 上垣外 英剛
title: Monolingual Paraphrase Detection for Low Resource Pashto Language at Sentence Level
abstract: Paraphrase detection is a task to identify if two sentences are semantically similar or not. It plays an important role in maintaining the integrity of written work such as plagiarism detection and text reuse detection. Formerly, researchers focused on developing large corpora for English. However, no research has been conducted on sentence-level paraphrase detection in low-resource Pashto language. To bridge this gap, we introduce the first fully manually annotated Pashto sentential paraphrase detection corpus collected from authentic cases in journalism covering 10 different domains, including Sports, Health, Environment, and more. Our proposed corpus contains 6,727 sentences, encompassing 3,687 paraphrased and 3,040 non-paraphrased. Experimental findings reveal that our proposed corpus is sufficient to train XLM-RoBERTa to accurately detect paraphrased sentence pairs in Pashto with an F1 score of 84%. To compare our corpus with those in other languages, we also applied our fine-tuned model to the Indonesian and English paraphrase datasets in a zero-shot manner, achieving F1 scores of 82% and 78%, respectively.
language of the presentation: English
 
FREDERIKUS HUDI M, 2回目発表 自然言語処理学 渡辺 太郎, 荒牧 英治, 上垣外 英剛
title:  Disentangling Pretrained Representation to Leverage Low-Resource Languages in Multilingual Machine Translation
abstract:  Multilingual neural machine translation aims to encapsulate multiple languages into a single model. However, it requires an enormous dataset, leaving the low-resource language (LRL) underdeveloped. As LRLs may benefit from shared knowledge of multilingual representation, we aspire to find effective ways to integrate unseen languages in a pre-trained model. Nevertheless, the intricacy of shared representation among languages hinders its full utilisation. To resolve this problem, we employed an extra objective of target language prediction, a central language-aware layer, and a monolingual adapter layer to improve representation in integrating LRLs. Focusing on improving LRLs in the linguistically diverse country of Indonesia, we evaluated five languages using a parallel corpus of 1,000 instances each, with experimental results measured by BLEU showing zero-shot improvement of 8.4 from the baseline score of 7.1 to a score of 15.5 at best. Further analysis showed that the gains in performance are attributed more to the disentanglement of multilingual representation in the encoder with the shift of the target language-specific representation in the decoder.
  [Keyword]:  Multilinguality,  Machine Translation,  Neural language representation models,  Less-Resourced/Endangered Languages.
language of the presentation:  English
 
SEIVERIGHT CARGILL DUJOHN M, 2回目発表 自然言語処理学 渡辺 太郎, 荒牧 英治, 上垣外 英剛
title: Can LLMs outperform rule-based dataset converters for taxonomic entities?
abstract: The advent of Large Language Models (LLMs) has brought a drastic shift in the paradigm of conventional NLP tasks. LLMs have been trained on a plethora of data ranging from general knowledge corpora to domain specific corpora. Furthermore, LLMs have been shown to be proven to be resourceful in substituting humans for laborious tasks like entity labeling and information extraction. However, they are not without flaws. Hence, nowadays, some NLP practitioners are trying to ascertain how to optimize the use of LLMs, such as designing effective prompts, and to discover where they fall short and how to address such shortcomings. Our study seeks to leverage LLMs as a structured dataset augmenter to convert an NER dataset in standoff format to CoNLL format. In addition, our study evaluates the performance of rule-based converter compared to our LLM workflow, as well as injecting erroneous values to observe the impact on performance degradation.
language of the presentation: English
 
JAN MEYER SARAGIH M, 1回目発表 ヒューマンAIインタラクション Sakriani Sakti, 渡辺 太郎, 大内 啓樹, Faisal Mehmood
title: Duration-Compliant Cross-Lingual Prosody Transfer in Automatic Dubbing
abstract: Automatic dubbing (AD) is a task to automate dubbing, which is a postproduction process used in filmmaking where translated audio is used to replace original recording to make the content more accessible to wider range of audience. Some important factors of dubbed video include isochrony (time of speech) to comply to original recording's duration and prosody (pitch, length, timbre, etc.) to preserve speech naturalness. Recent research related to AD has explored to possibility of creating a machine translation model that could predict the translated word duration to obey the duration constraint and the possibility of transferring prosody information to target language via prosody embedding. However, these advances focus specifically on one factor of dubbing instead of considering all the factors together. Therefore in this research, we plan to make a system that could comply to both duration and prosody constraints. To achieve this, we plan to use the prosody embedding as an additional input embedding in machine translation (MT) to alter the duration prediction of translated speech using prosody information. We also plan to add duration prediction task to end-to-end speech-to-text translation (S2TT) model, which hidden repersentation is mentioned to contain prosody information in addition to linguistic information.
language of the presentation: English