Title:
|
Bilingual prosodic dataset compilation for spoken language translation
|
Author:
|
Öktem, Alp; Farrús, Mireia; Bonafonte, Antonio
|
Abstract:
|
Comunicació presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona. |
Abstract:
|
This paper builds on a previous methodology that exploits
dubbed media material to build prosodically annotated
bilingual corpora. The almost fully-automatized
process serves for building data for training spoken language
models without the need for designing and recording
bilingual data. The methodology is put into use by
compiling an English-Spanish parallel corpus using a recent
TV series. The collected corpus contains 7000 parallel
utterances totaling to about 10 hours of data annotated
with speaker information, word-alignments and
word-level acoustic features. Both the extraction scripts
and the dataset are distributed open-source for research
purposes. |
Abstract:
|
The annotation work carried by the annotators was financed with the 2018 Maria de Maeztu Reproducibility Award from Department of Information and Communication Technologies of Universitat Pompeu Fabra received by the first author. The second author is funded by the Spanish Ministry through
the Ramón y Cajal program. |
Subject(s):
|
-Bilingual corpora -Spoken machine translation -Prosody |
Rights:
|
© 2018 ISCA
|
Document type:
|
Conference Object Article - Published version |
Published by:
|
International Speech Communication Association (ISCA)
|
Share:
|
|