Audio quality evaluations of the synthesized speech are important since the perceived audio quality of the synthesized speech may determine a system’s market success. Synthesized speech audio quality evaluations are usually done in either subjectively or objectively. The subjective approaches use real human evaluators but do not give any insight regarding the evaluators’ cognitive state. Thus we do not know how the stimuli affected the result. The objective approaches replace human evaluators with computer program to calculate the audio features. However, the relationship between the calculated speech features with the perceived quality is yet to be understood. Nevertheless, recent studies started to utilize the physiological reading from the evaluators to do quality assessment. Physiological reaction has advantages that it is not easy to conceal and gives insight regarding the evaluators’ cognitive state. Therefore it could give better insight about the perceived stimuli quality.
There are previous studies proposed the utilization of physiological signals for synthesized speech quality evaluation. The previous studies (1) did regression of individual opinion scores with only using brain activity features and (2) combined audio features with brain activity features only to see how those features correlated to individual opinion score. The issues in the previous works are both of them did not use neural network based model, making careful feature engineering is necessary. The work that combined audio features with brain activity did not perform regression to predict unseen data but rather only fitting the multilinear regression model. Lastly, since the brain activity is considered unique from person to person, it is difficult to generalize. This makes both previous works depend on the subjects used to train the model.
None of the previous works, however, investigated the brain activity reaction towards different synthesized speech audio quality. Hence, at the beginning of this thesis, we investigated the brain activity when stimulated with natural and several different type of synthesized speech. The goal was to see where, when, and how the brain reacted to the difference of the synthesized speech audio quality by controlling which audio feature is synthesized using the natural speech as the baseline. We separated the brain wave into five brain frequency band. We used generalized Fisher score to do the investigation of the difference and used Support Vector Machine regression on each frequency band basis. The result showed the difference in brain activity showed clearly in alpha and beta frequency band. However, beta frequency band yield lowest error in the regression task.
This thesis proposed to use convolutional neural network to address the issue of feature engineering. With the current ability of neural network based method, it is arguably possible to extract features from the input with minimal feature engineering. We also took the idea of combining the audio features with the brain activity features, however, we implemented it using late-fusion on neural network based architecture. Lastly, we tested the proposed model in both subjectdependent and subject-independent case to see whether the proposed model can generalize and do prediction on unseen data.
The result shows that the proposed neural network based architecture works significantly better with less feature engineering in subject-dependent scenario. The combined method also works better in both subject-dependent and subject independent scenarios compared to each baseline model. The combined method on subject-independent scenario also shows that the proposed method could handle totally unseen data, meaning that it is possible to use data from subjects whose data did not included in the train set.
Keywords: EEG, mean opinion score prediction, multimodal integration, convolutional neural network