| ZHONG JIAJUN | M, 2回目発表 | コンピューティング・アーキテクチャ | 中島 康彦, | 林 優一, | 張 任遠, | KAN Yirong, | PHAM HOAI LUAN, | Le Vu Trung Duong |
|
title: Spikceiver: Compact Audio-Visual Spiking Transformer with Depthwise Splitting and Bottleneck Attention
abstract: Audio-visual classification on edge and neuromorphic platforms requires high recognition accuracy under tight power constraints. Existing attention-based spiking Transformers still suffer from quadratic attention complexity with respect to audio-visual token length, making deployment on resource-constrained systems difficult. This research proposes Spikceiver, an audio-visual spiking Transformer that targets efficient multimodal fusion with a compact model size. Spikceiver integrates spiking bottleneck fusion using a small set of learnable Spiking Bottleneck Tokens that attend to each modality to produce compact latent representations for fusion, and an efficiency-oriented Spiking Depthwise-Separable Splitting front-end that replaces vanilla convolution for tokenization. On UrbanSound8K-AV and enhanced CIFAR10-AV (CIFAR10-AV(e)), Spikceiver achieves 95.05% and 89.78% top-1 accuracy with 4.84M MACs and 1.271M and 1.273M parameters, resulting in Efficiency-Accuracy Scores (EAS) of 0.963 and 0.895. Compared with the Multimodal Bottleneck Transformer (MBT-4-256), Spikceiver improves accuracy (95.05% compared with 84.01%, and 89.78% compared with 83.46%) while reducing MACs (4.84M compared with 7.26M) and parameters (about 1.27M compared with 6.40M). Compared with the Spiking Multimodal Transformer (SMMT-1-256), Spikceiver reduces MACs by more than eight hundred times with fewer parameters, demonstrating a favorable efficiency-accuracy trade-off for audio-visual spiking inference on resource-constrained platforms. language of the presentation: English | ||||||||
| 安藤 拓翔 | M, 1回目発表 | コンピューティング・アーキテクチャ | 中島 康彦, | 林 優一, | 張 任遠, | KAN Yirong, | PHAM HOAI LUAN, | Le Vu Trung Duong |
|
title: Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA
abstract: The rise of generative AI for tasks like Automatic Speech Recognition (ASR) has created a critical energy consumption challenge. While ASICs offer high efficiency, they lack the programmability to adapt to evolving algorithms. To address this trade-off, we implement and evaluate Whisper's core computational kernel on the IMAX, a general-purpose Coarse-Grained Linear Arrays (CGLAs) accelerator. To our knowledge, this is the first work to execute a Whisper kernel on a CGRA and compare its performance against CPUs and GPUs. Using hardware/software co-design, we evaluate our system via an FPGA prototype and project performance for a 28 nm ASIC. Our results demonstrate superior energy efficiency. The projected ASIC is 1.90x more energy-efficient than the NVIDIA Jetson AGX Orin and 9.83x more than an NVIDIA RTX 4090 for the Q8_0 model. This work positions CGLA as a promising platform for sustainable ASR on power-constrained edge devices. language of the presentation: English | ||||||||
| 岡本 光 | M, 1回目発表 | コンピューティング・アーキテクチャ | 中島 康彦, | 林 優一, | 張 任遠, | KAN Yirong, | PHAM HOAI LUAN, | Le Vu Trung Duong |
|
title: Efficient Zero-Knowledge Proof Accelerator: Hardware and Software for Multi-Scalar Multiplication
abstract: Zero-Knowledge Proof (ZKP) is a privacy-preserving protocol that allows a prover to demonstrate the validity of a statement without revealing its details. A widely used primitive of ZKP, Zero-Knowledge Succinct Non-Interactive Argument of Knowledge (zk-SNARK) has attracted significant attention in edge computing; however, edge devies face severe resource constraints when processing its computational bottleneck, multi-scalar multiplication (MSM). To address this issue, co-design hardware and software for MSM is necessary to achieve both affordable speed and superior energy/resource efficiency. This presentation describes the core idea of hardware-software integration. Additionaly, the presentation introduces the single hardware performance achieving up to 52.9x power efficiency compared to modern CPUs and 9.5x resource efficiency compared to prior works. It also includes the further improvement of throughput by rescheduling the computation logic. language of the presentation: English | ||||||||
| ROCHA CONFESSOR BRIAN | D, 中間発表 | サイバネティクス・リアリティ工学 | 清川 清, | 荒牧 英治, | 内山 英昭, | Perusquia Hernandez Monica, | 平尾 悠太朗 |
|
title: Hiki-game-ori: A Proposed Framework for Therapeutic Game Development for Socially Isolated Audiences
abstract: Severe social withdrawal and marginalization, such as hikikomori in Japan, is a growing global health challenge, often leading to significant social impairment and adverse long-term health effects. While video games are sometimes viewed as contributing to isolation, some researchers believe they can also offer a unique, low-risk medium for intervention and resocialization. However, to properly develop game-based interventions, it is important to understand which game elements should and should not be present in the developed games, and whether socially isolated audiences resonate well with the developed product. To address this gap, this work aims to establish the "Hiki-game-ori" framework, a structured approach for developing therapeutic games specifically designed to reach and reintegrate socially isolated audiences. This is achieved by understanding the specific gamer personality traits and psychological dispositions of players at high risk of social marginalization. From these results, prototype games are developed based on these game preferences and back-tested with socially isolated players, in order to understand how to better create tailored digital environments for such audiences. language of the presentation: English | |||||||
| MUSHAFFA RASYID RIDHA | D, 中間発表 | ヒューマンAIインタラクション | Sakriani Sakti, | 荒牧 英治, | 大内 啓樹, | Faisal Mehmood, | Bagus Tris Atmaja |
|
title: Toward Visual Pronunciation Learning: Refining rtMRI Data and Pipeline for Speech-to-Articulatory Animation
abstract: Most Computer-Assisted Pronunciation Training (CAPT) systems focus on scoring audio, often lacking the visual feedback necessary to show learners how to correct their articulation. While real-time Magnetic Resonance Imaging (rtMRI) offers high-resolution imaging of the vocal tract, and from it, visual feedback could be generated, it hasn't been used in the research of CAPT. First, we address the problem of the widely used USC-TIMIT dataset, where automatic labeling often results in inaccurate vocal tract contours. We introduce a data refinement method that combines Fully Convolutional Network (FCN)-based smoothing with a landmark point-to-edge curve projection technique. By removing outliers, training a model to smooth the error, then projecting landmarks to edge detected, we generate a better "ground truth" than the originally available landmark point on the dataset. Building on these refined labels, we present a novel speech-to-articulatory animation pipeline. We leverage a self-supervised learning (SSL) model, wav2vec 2.0 embeddings—proven to capture robust speech representations—and fine-tune them to predict vocal tract movements directly from audio. Unlike traditional approaches relying on MFCCs, our system (utilizing wav2vec 2.0 with LSTM layers) reconstructs more precise landmark contour coordinates. The result is a system capable of generating the landmark from the speech input. language of the presentation: English | |||||||
| TAMAYO LENARD PAULO VELASCO | D, 中間発表 | ソーシャル・コンピューティング | 荒牧 英治, | Sakriani Sakti, | 若宮 翔子, | PENG SHAOWEN | |
|
title: Towards a Multilingual IoT-LLM enabled System for Clinical Consultation
abstract: Communication gaps in clinical settings can significantly impact patient understanding, trust, and informed medical decision-making, particularly in multilingual healthcare environments. Although internet of things (IoT) technologies and large language models (LLMs) demonstrate strong potential to support clinical consultations, concerns regarding safety, robustness, and real-world deployment remain insufficiently addressed. To bridge this gap, this research aims to design and develop a deployable multilingual IoT-LLM enabled clinical consultation system. First, chatbot responses are evaluated across medical, ethical, and legal risk domains using prompt engineering techniques to establish a safety foundation. Second, a multi-assessment and multi-professional agent (MA–MPA) framework is introduced to enhance the robustness and reliability of risk estimation. The RAG-based MA–MPA system achieved a macro F1-score of 0.800 and a joint accuracy of 60.3%, outperforming baseline models and substantially improving cross-domain risk assessment, particularly in ethical and legal domains. Finally, a deployable multilingual IoT–LLM-enabled consultation system is developed, integrating real-time speech recognition, structured LLM reasoning, and a safety validation layer. The system supports multilingual explanations, clinical workflow integration, and human-centered evaluation through comprehension, usability, and attention-based analyses. Overall, this work contributes a structured validation framework and a deployable architecture to advance safe, trustworthy, and patient-centered AI in clinical settings. language of the presentation: English | |||||||