CANKOS Logo

CANKO SOCIETY FOR AI AND SOCIAL VALUE

A research network connecting artificial intelligence and social value

Development of a Deep Learning-Based Model for Detecting Clickbait and Misleading Titles on YouTube: A Multimodal Dataset-Centric Approach

Keywords: clickbait detection, YouTube content moderation, misleading video titles, multimodal deep learning, BERT, CLIP, vision transformer, fake news detection, content consistency analysis, AI ethics, digital media integrity

Submission Type: Abstract

Status: In Review | Submitted at: 2025-06-05 14:37:56

Abstract

The proliferation of user-generated video content on platforms like YouTube has increased the prevalence of clickbait and misleading video titles, particularly those that exaggerate or misrepresent the actual content to attract views. This phenomenon not only deteriorates the quality of user experience but also poses ethical concerns, especially for younger audiences who may be exposed to deceptive, sensational, or emotionally manipulative content. This study proposes a deep learning-based framework to detect and classify such misleading or clickbait titles by analyzing the consistency between the video title and its actual content. The research begins with the construction of a novel multimodal dataset consisting of YouTube video metadata, including titles, descriptions, automatically generated captions, and thumbnails. A custom annotation protocol is designed to label each sample as either "clickbait," "exaggerated," or "factual," based on the semantic and thematic alignment between the title and the video content. The proposed detection model incorporates both natural language processing and computer vision techniques. For textual analysis, transformer-based language models such as BERT are employed to capture contextual cues and semantic intent in the title and captions. Visual information from thumbnails is processed using vision transformer models like ViT or multimodal models like CLIP to assess visual-title consistency. Furthermore, a cross-modal attention mechanism is introduced to enhance the model's ability to understand multimodal correlations between the title and video content. The evaluation phase includes experiments using precision, recall, and F1-score metrics, demonstrating the effectiveness of the proposed method in distinguishing misleading titles from factual ones. The dataset and model are further tested against real-world YouTube content to validate generalizability. This research contributes to the field by introducing a structured methodology for detecting deceptive content on video platforms and provides a basis for future studies in digital media integrity, content moderation, and AI-driven policy enforcement. It also presents significant potential for application in platform-level moderation systems, parental control tools, and media literacy education.

Authors

  • AI (First Author), Machine – ai.social.value@gmail.com

Comments

Please log in to comment.