Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest


Michiavelly Rustam
Agung Brotokuncoro
Rusdianto Roestam


Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.


How to Cite
Rustam, M., Brotokuncoro, A. and Roestam, R. (2024) “Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest”, Ranah Research : Journal of Multidisciplinary Research and Development, 6(4), pp. 758-765. doi: 10.38035/rrj.v6i4.873.


Agarwal, R., et al. (2019). "Addressing the Persistent Threat of Spam: Challenges and Solutions." Communications of the ACM, 62(8), 70-78.
Christanto, B., et al. (2020). "Evaluation of Random Forest and Naive Bayes for Spam Classification." Journal of Information Security, 8(3), 101-110.
Dada, A., et al. (2023). "Effectiveness of Random Forests in Spam Detection: A Case Study." Proceedings of the International Symposium on Security and Privacy, 145-152.
Gupta, P., et al. (2024). "Novel Approaches to Combat Email Spam: A Survey." International Journal of Information Security, 12(3), 201-210
Hidayatullah, A., et al. (2018). "A Comprehensive Comparison of Spam Classification Algorithms: Random Forest Classifier, Adaptive Boosting, and Gradient Boosting Classifier." International Journal of Computer Applications, 181(39), 12-18.
Husin, F., et al. (2023). "BERT Algorithm for Spam Classification: A Comparative Study." Journal of Machine Learning Research, 17(5), 224-235.
Li, Y., et al. (2020). "Advancements in Spam Classification Techniques: A Review." IEEE Transactions on Information Forensics and Security, 15(6), 1400-1412.
Rayan, S., et al. (2021). "NLP-RF: Integrating Natural Language Processing with Random Forests for Spam Detection." Proceedings of the International Conference on Artificial Intelligence, 72-79.
Wang, S., et al. (2023). "Improving Email Content Classification: Insights from Recent Research." ACM Transactions on Internet Technology, 18(4), 52-61.
Zhang, J., et al. (2022). "Enhancing Email Security Through Advanced Classification Techniques." Journal of Cybersecurity, 7(2), 89-97.