dc.contributor
Melero i Nogués, Maite
dc.contributor.author
Zhou, Chenyue
dc.date.issued
2022-09-21T16:55:53Z
dc.date.issued
2022-09-21T16:55:53Z
dc.date.issued
2022-09-21
dc.identifier
http://hdl.handle.net/10230/54140
dc.description.abstract
Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Melero
dc.description.abstract
The lack of parallel corpora is one of the biggest challenges hindering progress in
Machine Translation for low-resource languages. In this work, we crawl and filter
parallel sentences in Catalan and Chinese from Wikipedia in order to compile a
parallel corpus of good quality. This paper describes the processes we follow to build
the corpus, including mining the text data, computing sentence embeddings,
extracting sentence alignment and filtering for better corpus quality. We manually
audit the corpus quality based on an error taxonomy. Results show that the automatic
filtering we applied makes a great improvement in the quality of our web-crawled
corpus. The corpus is later used as training data to finetune a multilingual Machine
Translation (MT) system in both CA→ZH and ZH→CA directions. Results show that
finetuning with our corpus successfully managed to improve BLEU score in both
directions on the Flores-101 public benchmark test sets, which demonstrates the
importance of corpus in MT and the quality of our Catalan-Chinese parallel corpus.
dc.format
application/pdf
dc.format
application/pdf
dc.rights
Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)
dc.rights
https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Parallel corpus
dc.subject
Corpus quality
dc.subject
Machine translation
dc.subject
Low-resource languages
dc.title
Building a Catalan-Chinese parallel corpus from Wikipedia for use in machine translation
dc.type
info:eu-repo/semantics/masterThesis