Data Augmentation Strategies for Natural Language Processing in Low-Resource and Indigenous Languages

Haradhan Mardi

Data Augmentation Strategies for Natural Language Processing in Low-Resource and Indigenous Languages ( Vol-11,Issue-2,March - April 2025 )
Author(s): Haradhan Mardi	Download Full Text PDF Download with Cover Page Total View : 622 Downloads : 6 Page No: 220-226 DOI: 10.22161/ijaems.112.21
Keywords:
Data Augmentation, Natural Language Processing, Low-Resource Languages, Indigenous Languages, Transfer Learning
Abstract:
Natural Language Processing (NLP), a field enabling computers to comprehend and interact with human language, faces significant challenges with low-resource languages—those with limited digital text availability—and indigenous languages deeply rooted in native cultural traditions. This review examines data augmentation techniques, which generate additional training data from existing limited resources, to address these challenges. The goal is to provide a comprehensive overview of how such methods can promote greater linguistic inclusion in artificial intelligence, particularly for languages from regions like India, Africa, and Latin America. The significance arises from the global diversity of over 7,000 languages, where only a small fraction receives substantial AI focus, marginalizing many others. Low-resource languages suffer from minimal datasets, resulting in suboptimal performance on tasks such as machine translation. Data augmentation mitigates this by producing synthetic yet effective data, such as through sentence paraphrasing, thereby improving model accuracy without requiring extensive new data collection. This review discusses foundational concepts of augmentation in NLP, various methodological categories, global case studies, emphasis on Indian languages including Santali and Bodo, integration with transfer learning using models like mBERT and XLM-R, ethical considerations, and prospective directions. Key findings indicate improvements of 10-20% in accuracy or BLEU scores for languages such as Guarani and Manipuri. Applications span education, healthcare, and cultural preservation, advocating for equitable AI development. Sourced from more than 35 publications spanning 2015-2024, the review incorporates perspectives from India, Africa, and other regions, drawing from repositories like the ACL Anthology and IEEE. It underscores the need for ethical practices to minimize bias and encourages collaborative efforts. Ultimately, data augmentation fosters a more enduring and inclusive NLP ecosystem, empowering indigenous voices in the digital realm.
Article Info:
Received: 25 Feb 2025; Received in revised form: 21 Mar 2025; Accepted: 25 Mar 2025; Available online: 31 Mar 2025
Cite This Article:
Citations: APA \| ACM \| Chicago \| Harvard \| IEEE \| MLA \| Vancouver \| Bibtex
Share:
Share

International Journal of Advanced Engineering, Management and Science

Google Scholar Analysis

For Authors

ijaems Issue

Facebook

Data Augmentation Strategies for Natural Language Processing in Low-Resource and Indigenous Languages

( Vol-11,Issue-2,March - April 2025 )

Author(s): Haradhan Mardi

Keywords:

Abstract:

Cite This Article:

Share:

Other related Journals

For Authors

Licence