A Novel Methodology for Classifying Wikipedia Articles: Insights into Digital Forensics and Content Integrity

  • Imran Khan Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan.
  • Nadia Tabassum Department of Computer Science, Virtual University of Pakistan, Lahore, Pakistan.
  • Muhammad Asim National College of Business Administration & Economics, Multan, Pakistan
  • Muhammad Hassan Ghulam Muhammad Department of Computer Science, IMS Pak Aims Lahore, Pakistan
  • Mushtaq Niazi National College of Business Administration & Economics, Lahore, Pakistan
  • Umer farooq Department of computer science Hamdard university Karachi
  • Muhammad Farrukh Khan Department of Computing, NASTP Institute of Information Technology Lahore Pakistan
Keywords: Wikipedia, Article Length (in word), Article Age (in days), Number of Edits, Article Viewer, Feature Articles, Good Articles, B-Class Articles, and C-Class Articles

Abstract

Well-written articles shape readers interaction with information, as top-ranked articles are more likely to be seen than those further down the ranks. We present a new approach to classifying Wikipedia articles across various quality dimensions, harnessing knowledge gained from expert assessments. The study also includes an attempt to develop a solid framework that meets the evaluation of the quality of an article for the sole purpose to ensure the integrity of the content in the multipoint structure of the Internet, and to have input for the applications of Digital Forensics. The suggested method: the article details is gathered using the Wikipedia API and a set of metrics is well-defined to store and analyze this information. The methodology then explores the relationship between independent variables (metrics of the articles) and the dependent variable (quality level as rated by the experts). Three machine learning algorithms (RF, J48, and NB) are then used to classify the articles. The classification is dragged along with the expert reviews to determine whether quality level of Wikipedia articles. The empirical evidence illustrates the effectiveness of the proposed approach, with average accuracies greater than 70% for the J48 algorithm. The precision, recall and F-measure values corresponding to the classification models’ accuracy exceed 0.7, representing a strong performance model. Overall, these findings indicate that the method uses reliable criteria, which classifies Wikipedia articles in accordance with experts' opinions, making it a reliable tool for quality assessment. In addition, the study underscores the significance of the combined focus on precision and recall for assessing the quality of a model, thereby demonstrating how useful this method is in ensuring that content can be trusted and as part of digital forensics.

 

Published
2024-12-17