Bilingual prosodic dataset compilation for spoken language translation

All of RECERCAT

To access the full text documents, please follow this link: http://hdl.handle.net/10230/35600

Title:	Bilingual prosodic dataset compilation for spoken language translation
Author:	Öktem, Alp; Farrús, Mireia; Bonafonte, Antonio
Abstract:	Comunicació presentada a: IberSpeech 2018, celebrat el 21 al 23 de novembre de 2018 a Barcelona.
Abstract:	This paper builds on a previous methodology that exploits dubbed media material to build prosodically annotated bilingual corpora. The almost fully-automatized process serves for building data for training spoken language models without the need for designing and recording bilingual data. The methodology is put into use by compiling an English-Spanish parallel corpus using a recent TV series. The collected corpus contains 7000 parallel utterances totaling to about 10 hours of data annotated with speaker information, word-alignments and word-level acoustic features. Both the extraction scripts and the dataset are distributed open-source for research purposes.
Abstract:	The annotation work carried by the annotators was financed with the 2018 Maria de Maeztu Reproducibility Award from Department of Information and Communication Technologies of Universitat Pompeu Fabra received by the first author. The second author is funded by the Spanish Ministry through the Ramón y Cajal program.
Subject(s):	-Bilingual corpora -Spoken machine translation -Prosody
Rights:	© 2018 ISCA
Document type:	Conference Object Article - Published version
Published by:	International Speech Communication Association (ISCA)
Share: