
Towards Arabic Image Captioning: A Transformer-Based Approach
The automatic generation of textual descriptions from images, known as image captioning, holds significant importance in various applications. Image captioning applications include accessibility for the visually impaired, social media enhancement, automatic image description for search engines, assistive technology for education, and many more. While extensive research has been conducted in English, exploring this challenge in Arabic remains limited due to its complexity. Arabic is one of the world's most widely spoken languages. Around 420 million native people speak this language. It is also the official language of 22 nations in the Middle East and North Africa. Image captioning in Arabic will foster inclusion, improve communication, and provide technological breakthroughs. Therefore, this study aims to develop an Arabic image caption generator using the Flicker30k dataset. The proposed model comprises an encoder, decoder, and translation transformer, which generate descriptive captions for Arabic images. The encoder component utilizes ResNet101, a powerful convolutional neural network, to extract rich visual features from the input images. The decoder module consists of an attention block that allows the model to focus on different parts of the image, and an LSTM model, used to generate the captions. Finally, the generated captions are translated using an English-to-Arabic dialect translation transformer. The model achieved a BLEU-4 score of 0.1253 for English-generated captions and medium user satisfaction for Arabic-generated captions. © 2023 IEEE.