森本 湧基 | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 上垣外 英剛, 大内 啓樹 |
title: Controllable Table Generation abstract: SlotFilling, natural language processing task, has the problem that we must predefined the schema. Text-to-Table, another natural language processing task, solves this problem because it can be trained in a data-driven manner, but it introduces other problems common to generative models, such as hallucination and duplication of expressions. We will investigate whether these problems can be prevented by adding constraints to Text-to-Table to control the size and content of the output table. This presentation provides an overview of Slot-Filling, Text-to-Table and the constraints of its output, and presents the results of replicating the Text-to-Table model from a related study and an analysis of the dataset used. language of the presentation: Japanese | |||
HANG JIANGNAN | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 上垣外 英剛 |
Title: Calibration on Neural Machine Translation by explanation Abstract: Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is beneficial for modificating the model from biased output. However, the phenomenon of miscalibration in NMT models has been widely discovered and theorytically proved. Previous approaches indicate that multiple statistics and syntactic roles are related to the miscalibration. In this research, we design the structure with auxiliary prediction to exploit the grammar information further. By delicately designed metric, the comparison between original feature representation and reconstructed representation from target can reveal the effectiveness in calibration. Simultaneously, we develop an analysis to investigate the relationship between datasets from varying domains and linguistic features. Our research endeavors to contribute a novel framework for the calibration task within the field of Neural Machine Translation (NMT). Language of the presentation: English | |||
FREDERIKUS HUDI | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 上垣外 英剛 |
Title:
Leveraging Underrepresented Languages in Multilingual Neural Machine Translation with Pre-trained Language Models
Abstract: Existing Natural Language Processing (NLP) researches mainly focused on languages with higher data-resource, leaving the lower resource counterpart behind. This leads to under-representations of Low Resource Languages (LRL) in multilingual datasets and multingual models, resulting in low quality performances on those languages. Recent approaches tried to alleviate the problem by pre-training the model on higher resource's similar languages, however it suffers from bias towards higher-resource's languages. In the setting of Multilingual Neural Machine Translation (MNMT) task, this results in low-quality translation for zero-shot setting and unintended translation to wrong language. We try to explore the possible solutions and challenges on these issues by taking a look at LRL case of Indonesia, a country with 700+ indigenous languages. Language of the presentation: English | |||
ALI IQRA | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 上垣外 英剛 |
Title: Corpus Collection of Low Resource Languages for Paraphrase Detection
Abstract: In this work we will present the high quality corpus for Pashto language at sentence level on real cases manually collected from journalism websites and annotated. The corpus will be developed on 8 different domains that includes Sports, Health and fitness, Economy, Science and technology, Art and literature, Entertainment, Politics, Natural Disasters. Corpus target is 5K instances manually collected from international and local Pashto newspaper websites. This resource will be the first of its kind developed for the Pashto language and we believe that it will be a valuable contribution to the evaluation of paraphrase detection systems. Language of the presentation: English | |||
SEIVERIGHT CARGILL DUJOHN | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 進藤 裕之 |
Title: Zero-shot IE from GPT4 Style Prompting
Abstract: Prompting of LLMs (Large Language Models) has gained traction because it allows for distillation of knowledge from high-resource to low-resource domains. Following this recent trend, we aim to investigate the under-studied domain of biodiversity in natural language text. Specifically, we leverage QA (Question and Answering) style prompts with GPT4 and curate new data to support both RE (Relation Extraction) and NER (Named Entity Recognition) of endangered species. We hope our efforts can go a long way in the protection of endangered species and support biodiversity. Language of the Presentation: English | |||
EUNIKE ANDRIANI KARDINATA | M, 1回目発表 | 自然言語処理学 | 渡辺 太郎, 荒牧 英治, 大内 啓樹 |
Title: Constructing Indonesian Travelogue Dataset
Abstract: Indonesian language is a low-resource, but currently growing language. We observe an increasing number of its speakers as well as a rise in the effort to research and develop the language. One possible research area to focus on is Geographic Information Science (GISc), where we use the knowledge in Natural Language Processing (NLP) research to process geographic data contained in texts. GISc is relevant and currently needed because we can no longer rely on the typical form and processing of geographic data. More often, valuable geographic information are actually contained in texts written by common people. Furthermore, as tourism recently recovers, we expect that there should be more data related to travel being generated. Hence, by constructing Indonesian travelogue dataset, we aim to conduct and promote even more research in Indonesian language, specifically in the context of traveling. We hope that this dataset could be utilized to provide further insights into the performance of existing technology and to assist in tackling the problems encountered specifically for data in Indonesian language. Language of the Presentation: English | |||