Abstract:
The study presents a detailed framework designed to develop a Question-Answering System
(QA System) for the Kazakh language, highlighting its importance in the field of Low Resource Languages
(LRL) Text Processing. This effort aims to fill the gap in resources for languages that lack substantial digital
tools. Specifically, the project focuses on geographical questions about Kazakhstan, aiming to enhance
accessibility and understanding of the nation's geography. The challenges associated with LRL text
processing are addressed through the creation of a question-answer corpus, training a Bidirectional Encoder
Representations from Transformers (BERT)-based model, and evaluating the system using Bilingual
Evaluation Understudy (BLEU) metrics. The endeavor begins with the careful compilation of a corpus
containing 50,000 questions, which supports the subsequent development phases and ensures the creation of
a robust QA System. In the second phase, a BERT model equipped with 91,821,056 parameters is trained,
enhancing the model’s ability to understand the complex linguistic nuances of the Kazakh language. The final
phase involves a rigorous evaluation using BLEU metrics, where the system achieves an impressive average
score of 0.9576. This score indicates a high level of agreement between the system-generated answers and
the reference answers, demonstrating the system’s effectiveness at interpreting and responding to queries
about Kazakh geography. This study significantly contributes to the field by providing a systematic and
nuanced approach to QA System development and underscores the model’s effectiveness through thorough
evaluation and comparative analysis.