Speech translation (S2S) technology breaks the language barrier and allows us to communicate with each other in different languages easily and transparently. However, in the conventional S2S system the paralinguistic information such as emotion, emphasis are ignored. Regardless how emotional, emphasis in the source speech is, the output destination speech will be always expressed in the same way.
This thesis attempts to solve the problem of translating the emphasis information in the S2S system by introducing two new components, the emphasis estimation and emphasis translation. The emphasis estimation estimates a real-numbered value for very word in an utterance, this value is also called word-level emphasis. So, each utterance will have an corresponding emphasis sequence. After that, the emphasis translation translates the estimated emphasis sequence into a target language emphasis sequence. As the result, the speech translation can convey the emphasis information in the translation process. In addition, this approach does not require to modify the conventional S2S system, so it is easy for integrating to any existing S2S system.
With regards to the experiments. We first perform the analysis on a emphasis corpus to see the way people emphasize words in individual languages and also across languages. We found out that the emphasis words have more power and duration than the normal words, and the Japanese speaker use less duration than English speaker because of the Japanese characteristic. A relatively high correlation coefficient also has been investigated indicates that it is possible to perform the translation using a simple linear regression mapping function. Inheriting the result from the analysis, we perform experiments on emphasis translation using conditional random fields. The result indicates that our translation model can translate the emphasis information accurately while preserving the naturalness of the audios.