ArabicQuest: Enhancing Arabic Visual Question Answering with LLM Fine-Tuning

Elmaghraby A.

Maged S.

Essawey M.

Elfaramawy R.

Negm E.

Khoriba G.

Artificial Intelligence

Circuit Theory and Applications

Software and Communications

In an attempt to bridge the semantic gap between language understanding and visuals, Visual Question Answering (VQA) offers a challenging intersection of computer vision and natural language processing. Large Language Models (LLMs) have shown remarkable ability in natural language understanding; however, their use in VQA, particularly for Arabic, is still largely unexplored. This study aims to bridge this gap by examining how well LLMs can improve VQA models. We use state-of-the-art AI algorithms on datasets from multiple fields, including electric devices, Visual Genome, RSVQA, and ChartsQA. We introduce ArabicQuest, a Text Question Answering (TQA) tool that combines Arabic inquiries with visual data. We assess the performance of LLMs across various question types and image settings and find that fine-tuning me thods su ch as LLaMA-2, BLIP-2, and Idefics-9B-Instruct models provide encouraging results, although challenges still arise in counting and comparison tasks. Our findings demonstrate the importance of advancing VQA further - especially for Arabic - to enhance accessibility and user satisfaction in a variety of applications. © 2024 IEEE.