Graduation Year
2021
Document Type
Thesis
Degree
M.S.C.S.
Degree Name
MS in Computer Science (M.S.C.S.)
Degree Granting Department
Computer Science and Engineering
Major Professor
Sriram Chellappan, Ph.D.
Co-Major Professor
John Licato, Ph.D.
Committee Member
Marvin Andujar, Ph.D.
Keywords
multilingual, neural networks, NLP, participatory research
Abstract
Machine Translation (MT) has the potential to bridge the gap between the developed world and the marginalized communities by making information more accessible in real-time. While there are over 7000 spoken languages in the world, only about a hundred have access to high-quality MT systems and even fewer enjoy the benefits of more advanced language technologies. Unfortunately, resource scarcity and the lack of digital infrastructure are only some of the many challenges associated with globalizing NLP. Many large-scale multilingual studies and datasets often get little to no feedback from native speakers or linguistic experts of the languages involved, leading to serious problems of data quality and potential biases. In this thesis, we present a case study of participatory research in 22 Turkic languages involving native speakers, language technologists, researchers, linguists, commercial entities, and more. Through this thesis, we compile and release the largest public corpus for MT in Turkic languages along with 26 bilingual baseline models. We outline the curation and release of public datasets, the development of machine translation technologies, and their deployment in real-world scenarios. In addition, we discuss the lessons learned through this case study, its applications, and limitations, as well as implications for future projects.
Scholar Commons Citation
Mirzakhalov, Jamshidbek, "Turkic Interlingua: A Case Study of Machine Translation in Low-resource Languages" (2021). USF Tampa Graduate Theses and Dissertations.
https://digitalcommons.usf.edu/etd/8829