CODNet: Context-based object detection network for multimodal image captioning and virtual question answering

IRIS

Present MLL (Multimodal Large Language) models do exceptionally well in computer vision tasks, such as answering virtual questions and captioning images. However, they are not up to par in important perceptual tasks like object detection. To overcome this constraint, an entirely novel research work is presented for contextual object detection, which aims to comprehend observable items in various human-AI interaction scenarios. This study investigates two widely used scenarios: captioning the images and virtual question answering. For human-machine interaction, a brand-new CODNet (Context-based Object Detection Network) is presented that can locate, recognize, and connect visual objects with spoken inputs. The proposed network is capable of end-to-end modeling of visual language contexts. This network unifies language and vision functions by interpreting images in a foreign language and coordinating language functions with vision-centric jobs that may be managed and created flexibly using language instructions. The network comprises three sub-modules: an encoder that extracts the required features from the context, a pre-trained large language model meta AI (Llama) for decoding the context, and a decoder for predicting anchor boxes around the contextually recognized objects in images. The features are extracted with the help of a convolutional neural network merged with multiplicative LSTM. A fine-tuned YOLOv6 model detects and locates the required objects in the image provided. The network is trained on the MS-COCO dataset, evaluated on real-time images, and achieved over 95% average weighted accuracy.

CODNet: Context-based object detection network for multimodal image captioning and virtual question answering

Gupta, Chhaya;Gill, Nasib Singh;Gulia, Preeti;Pau, Giovanni

2025-01-01

Abstract

Present MLL (Multimodal Large Language) models do exceptionally well in computer vision tasks, such as answering virtual questions and captioning images. However, they are not up to par in important perceptual tasks like object detection. To overcome this constraint, an entirely novel research work is presented for contextual object detection, which aims to comprehend observable items in various human-AI interaction scenarios. This study investigates two widely used scenarios: captioning the images and virtual question answering. For human-machine interaction, a brand-new CODNet (Context-based Object Detection Network) is presented that can locate, recognize, and connect visual objects with spoken inputs. The proposed network is capable of end-to-end modeling of visual language contexts. This network unifies language and vision functions by interpreting images in a foreign language and coordinating language functions with vision-centric jobs that may be managed and created flexibly using language instructions. The network comprises three sub-modules: an encoder that extracts the required features from the context, a pre-trained large language model meta AI (Llama) for decoding the context, and a decoder for predicting anchor boxes around the contextually recognized objects in images. The features are extracted with the help of a convolutional neural network merged with multiplicative LSTM. A fine-tuned YOLOv6 model detects and locates the required objects in the image provided. The network is trained on the MS-COCO dataset, evaluated on real-time images, and achieved over 95% average weighted accuracy.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2025

Appare nelle tipologie:

1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11387/199095

Citazioni

ND

ND

ND

social impact