ijaers social
facebook
twitter
Blogger
google plus

International Journal of Advanced Engineering, Management and Science


Data Augmentation Strategies for Natural Language Processing in Low-Resource and Indigenous Languages

( Vol-11,Issue-2,March - April 2025 )

Author(s): Haradhan Mardi


Download Full Text PDF
Download with Cover Page Total View : 390
Downloads : 5
Page No: 220-226
ijaems crossref doiDOI: 10.22161/ijaems.112.21

Keywords:

Data Augmentation, Natural Language Processing, Low-Resource Languages, Indigenous Languages, Transfer Learning

Abstract:

Natural Language Processing (NLP), a field enabling computers to comprehend and interact with human language, faces significant challenges with low-resource languages—those with limited digital text availability—and indigenous languages deeply rooted in native cultural traditions. This review examines data augmentation techniques, which generate additional training data from existing limited resources, to address these challenges. The goal is to provide a comprehensive overview of how such methods can promote greater linguistic inclusion in artificial intelligence, particularly for languages from regions like India, Africa, and Latin America. The significance arises from the global diversity of over 7,000 languages, where only a small fraction receives substantial AI focus, marginalizing many others. Low-resource languages suffer from minimal datasets, resulting in suboptimal performance on tasks such as machine translation. Data augmentation mitigates this by producing synthetic yet effective data, such as through sentence paraphrasing, thereby improving model accuracy without requiring extensive new data collection. This review discusses foundational concepts of augmentation in NLP, various methodological categories, global case studies, emphasis on Indian languages including Santali and Bodo, integration with transfer learning using models like mBERT and XLM-R, ethical considerations, and prospective directions. Key findings indicate improvements of 10-20% in accuracy or BLEU scores for languages such as Guarani and Manipuri. Applications span education, healthcare, and cultural preservation, advocating for equitable AI development. Sourced from more than 35 publications spanning 2015-2024, the review incorporates perspectives from India, Africa, and other regions, drawing from repositories like the ACL Anthology and IEEE. It underscores the need for ethical practices to minimize bias and encourages collaborative efforts. Ultimately, data augmentation fosters a more enduring and inclusive NLP ecosystem, empowering indigenous voices in the digital realm.

Article Info:

Received: 25 Feb 2025; Received in revised form: 21 Mar 2025; Accepted: 25 Mar 2025; Available online: 31 Mar 2025

Cite This Article:
Citations:
APA | ACM | Chicago | Harvard | IEEE | MLA | Vancouver | Bibtex
Share: