

ArabicQuest: Enhancing Arabic Visual Question Answering with LLM Fine-Tuning
In an attempt to bridge the semantic gap between language understanding and visuals, Visual Question Answering (VQA) offers a challenging intersection of computer vision and natural language processing. Large Language Models (LLMs) have shown remarkable ability in natural language understanding; however, their use in VQA, particularly for Arabic, is still largely unexplored. This study aims to bridge this gap by examining how well LLMs can improve VQA models. We use state-of-the-art AI algorithms on datasets from multiple fields, including electric devices, Visual Genome, RSVQA, and ChartsQA. We introduce ArabicQuest, a Text Question Answering (TQA) tool that combines Arabic inquiries with visual data. We assess the performance of LLMs across various question types and image settings and find that fine-tuning me thods su ch as LLaMA-2, BLIP-2, and Idefics-9B-Instruct models provide encouraging results, although challenges still arise in counting and comparison tasks. Our findings demonstrate the importance of advancing VQA further - especially for Arabic - to enhance accessibility and user satisfaction in a variety of applications. © 2024 IEEE.