Status : Verified
| Personal Name | Visperas, Moses L. |
|---|---|
| Resource Title | On the Effects of Language Clustering for Low-Resource Multilingual Machine Translation Model for Select Philippine Languages |
| Date Issued | 24 January 2024 |
| Abstract | Language translation is a tedious task for written communication that needs mastery for both the source and the target language. In efforts to reduce the amount of work necessary to perform this task, machine translation (MT) systems are present. Commercial MT systems are available online but only support a few Philippine languages. Meanwhile, recent studies about MT systems for other Philippine languages are either outdated or perform poorly when translating to or from Tagalog. Neural machine translation (NMT) has gained vast popularity over the years – with it achieving significant quality across multiple languages. However, NMT still underperforms when translating low-resource language pairs such as most Philippine languages. In this study, we investigated the effectiveness of language clustering in multilingual machine translation for low-resource Philippine languages. We analyzed two clusters: the northern Philippine cluster (Ibanag, Ilocano, Pangasinan) and the Central Philippine cluster (Bicolano, Cebuano, Hiligaynon, Waray), comparing them to a baseline multilingual model consisting of all the aforementioned languages. Our models, trained on the JW300 dataset using T5-small architecture, were evaluated using n-gram (BLEU, NIST, METEOR) and distance-based (TER) metrics. We statistically validated our models with paired bootstrap resampling where we sampled and evaluated 300 sentence pairs for each translation direction from an original 1000 test data for 1000 iterations. All models were able to produce good translations where on average they have BLEU scores of 30 and above. Initial analysis revealed that language clustering showed potential benefits, improving confidence intervals in translations from Tagalog to other languages. Specifically, it resulted in an average gain of 0.7813 and 0.8754 BLEU points for the lower and upper bounds of the confidence intervals, respectively. However, detailed statistical analysis using paired bootstrap resampling found |
| Degree Course | MS Electrical Engineering |
| Language | English |
| Keyword | machine translation; deep learning; multilingual; language clustering |
| Material Type | Thesis/Dissertation |
Preliminary Pages
586.10 Kb
Category : P - Author wishes to publish the work personally.
Access Permission : Limited Access
