Analysis of Artificial Intelligence Algorithms for the Recognition of Images, Texts, and Audio Signals on Mobile Devices

Aida Mustafayeva; Elmira Israfilova; Gunel Baxshiyeva; Saadat  Aslanova

doi:10.30546/UNECCSDT.2026.001.214

Authors

Aida Mustafayeva Mingachevir State University Author https://orcid.org/0000-0003-0801-5605
Elmira Israfilova Mingachevir State University Author https://orcid.org/0000-0002-9476-5279
Gunel Baxshiyeva Mingachevir State University Author https://orcid.org/0000-0002-2122-7859
Saadat Aslanova Mingachevir State University Author https://orcid.org/0000-0002-5280-6941

DOI:

https://doi.org/10.30546/UNECCSDT.2026.001.214

Keywords:

Artificial Intelligence, Neural Network Architectures, Convolutional Neural Networks (CNN), Mobile and Edge AI, Intelligent Cyber-Physical Platforms, Real-Time Recognition, Multimodal AI

Abstract

This article investigates the development and application of a hybrid multimodal neural model that enables the synchronous processing of visual imagery and textual data. The primary objective of the study is to design a computationally efficient, explainable, and adaptive decision-making system capable of real-time object detection and recognition, based on the integration of convolutional neural networks (CNNs) with transformer-based natural language processing models. An analysis of existing studies indicates that approaches relying exclusively on either visual or textual models fail to provide a comprehensive semantic interpretation of events. To overcome this limitation, a multimodal framework entailing the joint analysis of images and descriptive text has been adopted, allowing object detection and visual interpretation to be performed with higher accuracy and operational efficiency.

The detection of traffic rule violations in road transportation systems is selected as the primary research object. In addition, the applicability of the proposed methodology to other domains is examined, including automatic object identification, detection of employee tardiness, monitoring of safety violations in industrial environments, and various supervision and control scenarios. The proposed hybrid multimodal model comprises three principal stages: feature extraction from visual data using convolutional neural networks, semantic analysis of textual data employing transformer-based natural language processing models, and multimodal integration of the extracted features and semantic representations. The software implementation of the model was developed in the Python programming language using the PyTorch framework, and its real-time performance was evaluated under laboratory conditions.

Experimental results demonstrate that the proposed approach not only satisfies ethical, legal, and technical requirements but also exhibits broad application potential across diverse domains, ranging from urban surveillance systems to agriculture, industrial facilities, and public monitoring mechanisms. Comparative analyses confirm both the theoretical novelty of the model and its practical advantages as an adaptive, computationally efficient, and explainable solution. These characteristics enable the model to function as a reliable decision-making system in various application scenarios and open promising perspectives for its future development.

Analysis of Artificial Intelligence Algorithms for the Recognition of Images, Texts, and Audio Signals on Mobile Devices

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Categories

How to Cite

Similar Articles