ゼミナール発表

日時: 5月30日（月）3限 (13:30-15:00)

会場: L1

司会：大平雅雄

赤峯享	D	松本裕治松本健一新保仁小町守
発表題目：大規模Web情報分析のための分析対象ページの段階的選択発表概要：商品購入などの意思決定を支援するために，インターネット上のWebページから意見などを抽出し，利用者に意見の全体像を提示したり，特徴的な意見やページを提示したりする情報分析システムが望まれている．意見分析などの情報分析は計算コストのかかる重い処理なので，関連するページ集合の一部を分析対象として選択して実行する必要がある．Webは様々な種類の文書が混在しており，分析対象に適さないページが大量に存在する．したがって，情報分析では，分析対象として適したページ集合を効率的に選択することが重要な課題となる．本発表では，10億ページ規模の大規模Webページを収集して，分析対象となる 1億ページ規模のWebページ集合をクエリ独立で選択するための方式を提案する．本方式は，以下を特徴とする．(1)Webページの選択を，商品カタログページやコピーページなどの不適格ページのフィルタリングと，ページランクやページ中の出現単語等でバイアスをかけた重み付きサンプリングで行う．(2)多段階で選択を行い，計算コストのかかる後段の処理結果を前段にフィードバックする．

HABIB ASAD	D	松本裕治松本健一新保仁浅原正幸
発表題目：Urdu Keypads and Keyboards for Touch-Screen Devices 発表概要：NLP has numerous applications at the ‘Characters level’. These include Romanization, Transliteration, Script Generation, Input System and/or Interface Designs etc. Urdu is the 2^nd largest Arabic script language according to the number of speakers. However its little presence on the internet does not qualify its rank. Among the major causes behind this is the limited platform support and meager interface designs for composing write-ups in Urdu. Lately, more and more data is being generated and uploaded using touch-screen smart gadgets that come in various shapes and screen sizes such as tablet PCs and mobile phones etc. Different interfaces suit different devices for users who need to input data in different natural language(s). Full keyboard replica designs with base and shift versions come forth with usability as well as visibility problems; hence not viable for small touch-screen systems. Designing optimized Urdu keypads for small screen widgets is an intricate task due to its relatively large alphabet set. We have tried to unfold this problem using unigram and bigram frequencies in a large corpus. Experimental results divulge ample significance. Apart from visibility and usability issues, small screen devices bring about health hazards to the user. We put forth hygiene in focus and designed touch-screen keypads that would be ideal for fast, correct and easy Urdu composing. Our optimization technique for arrangement of alphabets and unique interface for data input is extendable and equally applicable to other natural languages.

磯谷亮輔	D	中村哲松本裕治戸田智基
発表題目：全国音声翻訳実証実験の実施とログデータを用いた音声認識のモデル適応発表概要：異言語間のコミュニケーションを支援する音声翻訳は，大規模音声言語コーパスに基づくモデル学習手法の進展により，ドメインを限定した状況においては実用レベルの性能に近づきつつある．しかし，実フィールドでは，実験室評価と比べ性能が大きく劣化するという問題がある．この問題の解決には，実フィールドで収集した実利用ログデータを用いてモデルを適応することが有望と考えられる．そこで今回，全国5地方で大規模音声翻訳実証実験を実施し，収集したログデータ約5万発話を用いて，音声認識のモデル適応の評価を行い，その有効性を検証した．さらに詳細な比較評価により，音響モデル適応と言語モデル適応それぞれの効果，適応データ量と性能の関係，適応効果やデータの地方依存性，人手による書き起こしを必要としない教師なし適応の可能性等に関して明らかにした．