Building a Catalan-Chinese parallel corpus from Wikipedia for use in machine translation

Zhou, Chenyue

Building a Catalan-Chinese parallel corpus from Wikipedia for use in machine translation

dc.contributor

Melero i Nogués, Maite

dc.contributor.author

Zhou, Chenyue

dc.date.issued

2022-09-21T16:55:53Z

dc.date.issued

2022-09-21T16:55:53Z

dc.date.issued

2022-09-21

dc.identifier

http://hdl.handle.net/10230/54140

dc.description.abstract

Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Maite Melero

dc.description.abstract

The lack of parallel corpora is one of the biggest challenges hindering progress in Machine Translation for low-resource languages. In this work, we crawl and filter parallel sentences in Catalan and Chinese from Wikipedia in order to compile a parallel corpus of good quality. This paper describes the processes we follow to build the corpus, including mining the text data, computing sentence embeddings, extracting sentence alignment and filtering for better corpus quality. We manually audit the corpus quality based on an error taxonomy. Results show that the automatic filtering we applied makes a great improvement in the quality of our web-crawled corpus. The corpus is later used as training data to finetune a multilingual Machine Translation (MT) system in both CA→ZH and ZH→CA directions. Results show that finetuning with our corpus successfully managed to improve BLEU score in both directions on the Flores-101 public benchmark test sets, which demonstrates the importance of corpus in MT and the quality of our Catalan-Chinese parallel corpus.

dc.format

application/pdf

dc.format

application/pdf

dc.language

eng

dc.rights

Llicència CC Reconeixement-NoComercial-SenseObraDerivada 4.0 Internacional (CC BY-NC-ND 4.0)

dc.rights

https://creativecommons.org/licenses/by-nc-nd/4.0/

dc.rights

info:eu-repo/semantics/openAccess

dc.subject

Parallel corpus

dc.subject

Data mining

dc.subject

Corpus quality

dc.subject

Machine translation

dc.subject

Catalan

dc.subject

Chinese

dc.subject

Low-resource languages

dc.title

Building a Catalan-Chinese parallel corpus from Wikipedia for use in machine translation

dc.type

info:eu-repo/semantics/masterThesis

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Treballs d'estudiants [4946]