Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. The pretrained AraBERT models are publicly available on this https URL hoping to encourage research and applications for Arabic NLP. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Through extensive experiments on three VideoQA datasets, we demonstrate better performances of the proposed method in comparison with the state-of-the-arts. Within each module, we introduce a multimodal attention mechanism to aid the extraction of question-video interactions, with residual connections adopted for the information passing across different levels. Under the guidance of such question-specific semantics, VI infers the visual clues from the local-to-global multi-level interactions between the question and the video. Given the temporal pyramid constructed from a video, QT builds the question semantics from the coarse-to-fine multimodal co-occurrence between each word and the visual content. The TPT model comprises two modules, namely Question-specific Transformer (QT) and Visual Inference (VI). Targeting these issues, this paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA. While existing approaches seldom leverage the appearance-motion information in the video at multiple temporal scales, the interaction between the question and the visual information for textual semantics extraction is frequently ignored. Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |