コロキアムB発表

日時: 9月17日(金)2限（11:00~12:30）

会場: L1

司会: 澤邊太志

佐賀　健志	M, 2回目発表	知能コミュニケーション	中村　哲,	渡辺　太郎,	作村　諭一,	田中　宏季
title: Multimodal Dataset of Social Skills Training in Natural Conversational Setting abstract: Social Skills Training (SST) is commonly used in psychiatric rehabilitation programs to improve social skills. It is especially effective for people who have social difficulties related to mental illnesses or developmental difficulties. Previous studies revealed several communication characteristics in Schizophrenia and Autism Spectrum Disorder. However, a few pieces of research have been conducted in natural conversational environments with computational features since automatic capture and analysis are difficult in natural settings. Even if the natural data collection is difficult, the data clearly have much better potential to identify the real communication characteristics of people with mental difficulties and the interaction differences between participants and trainers. Therefore, we collected a one-on-one SST multimodal dataset to investigate and automatically capture natural characteristics expressed by people who suffer from such mental difficulties as Schizophrenia or Autism Spectrum Disorder. To validate the potential of the dataset, using partially annotated data, we trained a classifier for Schizophrenia and healthy control with audio-visual features. We achieved over 85% accuracy, precision, recall, and f1-score in the classification task using only natural interaction data, instead of data captured in the specific tasks designed for clinical assessments. language of the presentation: English

徳山　太顕	M, 2回目発表	知能コミュニケーション	中村　哲,	渡辺　太郎,	作村　諭一,	Sakriani Sakti,	須藤　克仁
Title: Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-based Speech-to-Text Translation Abstruct: In spoken communication, a speaker may convey their message in words (linguistic cues) with supplemental information (paralinguistic cues) such as emotion and emphasis. Transforming all spoken information into a written or verbal form is not trivial, especially if the transformation has to be done across languages. Most existing speech-to-text translation systems focus only on translating linguistic information while ignoring paralinguistic information. A few recent studies that proposed paralinguistic translation used a machine translation with hidden Markov model (HMM)-based automatic speech recognition (ASR) and text-to-speech (TTS) that were complicated and suboptimal. Furthermore, paralinguistic information was kept in the acoustic form. Here, we focused on transcribing paralinguistic acoustic cues of emphasis in the target language text. Specifically, we constructed cascade and direct neural Transformer-based speech-to-text translation, and we investigated various methods of expressing emphasis information in the written form of the target language. We performed our experiments on a Japanese-to-English linguistic and paralinguistic speech-to-text translation framework. The results revealed that our proposed method can translate both linguistic and paralinguistic information while keeping the performance as in standard linguistic translation. Language of the presentation: Japanese

有岡　無敵	M, 2回目発表	知能コミュニケーション	中村　哲,	渡辺　太郎,	須藤　克仁,	Sakriani Sakti,	吉野　幸一郎（客員准教授）
title:Persuasive Dialogue System Based on Multimodal Information abstract:A persuasive dialogue system refers to a dialogue system in which the system interacts with the user to change his/her behavior to achieve the system's goals. In a persuasive dialogue, it is expected that the system can use multimodal information to persuade the user. The system can constantly acquire such information by using a camera and a microphone, and can determine whether a person is lying or not, and what kind of emotions the person is expressing, which are difficult for the human eyes and ears to capture. In order to make good use of multimodal information in persuasive dialogue, it is essential to analyze the persuasion process. In this study, we created a multimodal persuasive dialogue dataset using dialogue android ERICA, and analyzed which features are effective for persuasion. In the future, we will apply this knowledge to complete a persuasive dialogue system that utilizes multimodal information. language of the presentation:Japanese 発表題目:マルチモーダル情報を用いた説得対話システム発表概要:説得対話システムとは，システムがユーザーとの対話を通じて，システムの目的を達成するための行動変容を促すような対話システムを指す．説得対話では，システムがマルチモーダル情報を活用することで説得を有利に進められることが期待される．システムはカメラやマイクを用いることで常にそれらの情報を取得することができ，嘘をついているか，どういった感情を持っているか，など人の目や耳では捉えづらい内容を判断することができる．マルチモーダル情報を説得対話に上手く活用するためには，説得の様子を分析することが必要不可欠である．本研究では，対話アンドロイドERICAを用いてマルチモーダル説得対話データセットを作成し，どのような特徴が説得に有効かを分析した．今後はこの知見を活かしマルチモーダル情報を活用する説得対話システムの完成を目指す．

胡　尤佳	M, 2回目発表	知能コミュニケーション	中村　哲,	渡辺　太郎,	須藤　克仁,	Sakriani Sakti
title: ASR Posterior-based Loss for Multi-task End-to-End Speech Translation abstract: End-to-end speech translation (ST) translates source language speech directly into target language without an intermediate automatic speech recognition (ASR) output, as in a cascading approach. End-to-end ST has the advantage of avoiding error propagation from the intermediate ASR results, but its performance still lags behind the cascading approach. A recent effort to increase performance is multi-task learning using an auxiliary task of ASR. However, previous multi-task learning for end-to-end ST using cross entropy (CE) loss in ASR-task targets one-hot references and does not consider ASR confusion. In this study, we propose a novel end-to-end ST training method using ASR loss against ASR posterior distributions given by a pre-trained model, which we call ASR posterior-based loss. The proposed method is expected to consider possible ASR confusion due to competing hypotheses with similar pronunciations. The proposed method demonstrated better BLEU results in our Fisher Spanish-to-English translation experiments than the baseline with standard CE loss with label smoothing. language of the presentation: English

会場: L2

司会: 矢田竣太郎

LI ZHONGLUO	D, 中間発表	数理情報学	池田　和司,	向川　康博,	吉本　潤一郎,	福嶋　誠,	日永田　智絵


PINEDA RIZA RAE	D, 中間発表	数理情報学	池田　和司,	向川　康博,	吉本　潤一郎,	久保　孝富,	福嶋　誠,	日永田　智絵
Title: Social Behavior Analysis of Wild Primates using Deep Neural Networks Abstract: Behavioral studies involve the observation of interactions between organisms in the environment. These observations in animals, supported by evolutionary evidences, have led to deeper understanding of human evolutionary patterns. Primates, being the closest genetic relative of humans, are the subject of many studies aiming to map developmental and mutative patterns to potentially understand human behavior and instincts. Despite the growing field of ethology, biologists are still limited to manual or semi-automated analysis due to hard limitations in data collection and pre-processing. These issues are heightened when the subjects in study are in their natural habitat as opposed to a confined laboratory setup. Capturing data in the wild is not only logistically difficult but also complex since the capture parameters are highly dependent on the state of the natural environment. Video recording is currently the least invasive method that allows biologists to keep a permanent visual record for future analysis without affecting the behavior of the subjects. However, thorough summarization of visual data through frame-by-frame review is time-consuming, most especially when dealing with large datasets. Machine learning techniques have been utilized to off-load the laborious task of visual data pre-processing. Despite these, object tracking and occlusion handling are issues that scientists are still attempting to address. With the overall goal of developing a robust framework for monkey social behavior analysis, our study will focus on feature extraction and activity classification and explore the automated construction of social networks of monkey troops to easily capture the dynamics of social structures in this group. We aim to utilize this network to further analyze how behaviors and diseases are transmitted within a social group. This may also support the study of reproductive behavior (selection or rejection of mates) of monkeys which can aid in conservation planning. In this preliminary work, we will present our scene-agnostic tracking method using a cascade of two deep neural networks on videos captured in the wild. We will then show how we utilized this system to generate social networks for the population of monkeys in the study. Language of the presentation: English

村重　哲史	D, 中間発表	数理情報学	池田　和司,	向川　康博,	吉本　潤一郎,	久保　孝富,	福嶋　誠,	日永田　智絵