Mining the Shadows: A Hybrid NLP Framework for Dark Web Cybercrime Investigation

Bilal Khan; Ans Riaz; Kausar Parveen

doi:10.54692/ijeci.2025.0901/246

Bilal Khan Department of Computer Science, National College of Business Administration and Economic, Lahore, Pakistan
Ans Riaz School of Physics, Engineering and Computer Science, University of Hertfordshire, UK
Kausar Parveen Department of Computer Sciences, National College of Business Administration & Economics, Lahore,Pakistan

DOI: https://doi.org/10.54692/ijeci.2025.0901/246

Keywords: digital forensics teams, Dark Web, malicious software, Natural Language Processing, cybercrime, BERT, RoBERTa, named entity recognition, IRB-aligned

Abstract

The Dark Web is one of the central hubs of cyber-crime, where such actors discuss campaigns, trade illegal materials, and sell malware. The traditional audit of such environments is non-scalable and inefficient, limited by sheer scale, linguistic diversity and intentional content obfuscation. This article proposes a hybrid Natural Language Processing (NLP) system that can be used to investigate cybercrime automatically on the Dark Web forums. The system was developed to build on the earlier research and transformer-based models like BERT and RoBERTa have been employed with the typical preprocessing steps. Custom components deal with named-entity recognition (NER), topic modeling, sentiment and intent classification and extraction of threat-keywords. Author-tracking across aliases can be achieved with the help of lexical and behavioral features based on stylometric profiling. Experimental analyses show high precision of identifying entities, clustering cybercriminal dialogue and intent categorization, which exceeds baseline models by precision and recall measure. Additional distinction of the system is achieved by the inclusion of a rule-aware ethical scraping protocol as well as an IRB-friendly data-processing layer. Using the conversion of raw and noisy forum text to structured threat intelligence, the framework enables scalable, real-time operation to surveillance the landscape of cybercriminal ecosystems and to provide actionable intelligence to cybersecurity researchers, digital forensics experts, commercial law-enforcement agencies, and any downstream consumers of threat data.

Mining the Shadows: A Hybrid NLP Framework for Dark Web Cybercrime Investigation

Abstract

Most read articles by the same author(s)