Abstract:
|
The goal of the study is to predict acoustic features of expressive speech from semantic vector space representations. Though a lot of successful work was invested in expressiveness analysis and prediction, the results often involve manual labeling, or indirect prediction evaluation such as speech synthesis. The proposed analysis aims at direct acoustic feature prediction and comparison to original acoustic features from an audiobook. The audiobook is mapped in a semantic vector space. A set of acoustic features is extracted from the same utterances, involving iVectors trained on MFCC and F0 basis. Two regression models are trained with the semantic coordinates, DNNs and a baseline CART. Later, semantic and acoustic context features are combined for the prediction. The prediction is achieved successfully using the DNNs. A closer analysis shows that the prediction works best for larger utterances or utterances with specific contexts, and worst for general short utterances and proper names. |