Deep Learning-Driven Malicious URL Detection: A comprehensive analysis using Convolutional Neural Networks A feature engineering on the Phiusiil dataset

Authors

  • Sunbal Faraz Hayat Pakistan Navy, Islamabad , Pakistan
  • Mazhar Iqbal Sharif Department of CS&EE, Sharif College of Engineering and Technology, Lahore, Pakistan
  • Hafiz Muneeb Ahmad IITECH College of Computer Sciences, IITECH Gujranwala, Pakistan
  • Ali Raza Lateef International Collaborative Research Group, Lahore, Pakistan
  • Abdul Wahab Waseem International Collaborative Research Group, Lahore, Pakistan
  • Imran Ahmad International Collaborative Research Group, Lahore, Pakistan

DOI:

https://doi.org/10.54692/ijeci.2026.1001/266

Keywords:

Deep Learning, Convolutional Neural Networks, Malicious URL Detection, Phishing Detection, Feature Engineering, Cybersecurity, Machine Learning, WEKA, Neural Network Architecture, PhiUSIIL Dataset, ROC-AUC Analysis, Real-time Threat Detection

Abstract

Malicious URLs are a significant cybersecurity threat, which promotes phishing, malware downloading, and data breach that jeopardize the security of millions of users worldwide. Conventional methods of detection, such as blacklist based systems and rule based heuristics, are shown to be very weak when it comes to dealing with zero-day threats and adversarially-generated URLs. The study is an in-depth study of deep learning malicious URL detection models, using Convolutional Neural Networks (CNN) with advanced feature engineering algorithms. As a result of the study, a rigorous experimental approach was followed, with the help of WEKA that utilizes the PhiUSIIL Phishing URL Dataset (available at the UCI Machine Learning Repository), consisting of 235,795 instances (134,850 legitimate and 100,945 phishing URLs) with 48 comprehensive features. The proposed CNN architecture with a dropout regularization and batch normalization make the architecture excel in performance measures: 99.12% accuracy, 98.95% precision, 99.28% recall, and 99.11% F1-score, showing a significant improvement over the baseline machine learning algorithms such as the Random Forest (97.84% accuracy), Support Vector Machines (96.7 The study utilizes PRISMA standards of systematic literature review, and applies rigorous evaluation criteria such as confusion matrix, ROC-AUC curves, computational efficiency measures, and feature ranking using gradient-weighted class activation mapping. Findings reveal that CNN architecture is a useful model to learn complex non-linear patterns within URL structure, and lexical (length of URL, distribution of special characters) and host-based (domain age, WHOIS information) attributes have the most significant discriminative power. The results have a strong impact on the field of cybersecurity as they provide a solid framework of real-time malicious URL detection, which has been tested under strict statistical analysis and cross-validation procedures.

Downloads

Published

2026-04-27

Issue

Section

Articles