SMS SPAM Filtering With Machine Learning Model Distilbert
Keywords:
SMS Spam Detection, BERT, NLP, Transformer Models, Text ClassificationAbstract
This work crops up and looks to address the fact that SMS spam detection involves transforming the language by latest technology such as transformers with BERT as its representative. Hijacked unsolicited messages pose serious threats to user privacy, network efficiency, and integrity in conversations. Traditional rule-based models and shallow machine learning techniques usually do not stand ground concerning their methods of capturing the continuously changing linguistic expressions of spam messages. Therefore, we have sought after the possibilities of understanding the contextual relationships by using BERT, the best model in natural language. A public dataset from Kaggle with spam messages and labelled ham messages was used in fine-tuning the model for training and testing. The data was processed and fed into the BERT model for fine-tuning in a binary classification setting. Performance metrics for the model were derived from the standard metrics such as accuracy, precision, recall, and F1 score. BERT is best poised to capture exceedingly subtle language features and detect spam messages accurately; thus, it proves to be a viable option for any real-time SMS filtering systems.
Downloads
References
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the NAACL-HLT, 2019, pp. 4171–4186.
Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” arXiv preprint arXiv:1706.03762, 2017.
T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the Study of SMS Spam Filtering: New Collection and Results,” in Proc. of the 11th ACM Symposium on Document Engineering, 2011, pp. 259–262.
C. Sun, X. Qiu, and X. Huang, “How to Fine-Tune BERT for Text Classification,” in Chinese Computational Linguistics – CCL 2019, pp. 194–206.
S. Wadhawan and S. Shandilya, “SMS Spam Detection Using BERT and NLP Techniques,” International Journal of Computer Applications, vol. 183, no. 18, pp. 25–30, 2021.
S. Minaee, N. Kalchbrenner, E. Cambria, M. Chenaghlu, and J. Gao, “Deep Learning Based Text Classification: A Comprehensive Review,” ACM Computing Surveys, vol. 54, no. 3, pp. 1–40, 2021.
Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding Convolutional Neural Networks for Text Classification,” in Proceedings of IJCAI, 2015, pp. 2371–2377.
J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” in Proceedings of ACL, 2018, pp. 328–339.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019.
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” arXiv preprint arXiv:1906.08237, 2019.
Y. Zhou and L. Chen, “Spam Message Detection Based on BERT and CNN-LSTM Model,” Journal of Intelligent & Fuzzy Systems, vol. 39, no. 5, pp. 6781–6789, 2020.
R. Mishra and R. Bhatnagar, “Real-Time SMS Spam Filtering Using Transformer-Based NLP Models,” International Journal of Information Technology, vol. 14, no. 3, pp. 1105–1112, 2022.
K. Ravi and V. Ravi, “A Survey on Opinion Mining and Sentiment Analysis: Tasks, Approaches and Applications,” Knowledge-Based Systems, vol. 89, pp. 14–46, 2015.
V. Kharde and P. Sonawane, “Text Mining and Sentiment Analysis of Twitter Data,” International Journal of Computer Science and Information Technologies, vol. 7, no. 6, pp. 2343–2346, 2016.
W. Y. Wang, “Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection,” in Proceedings of ACL, 2017, pp. 422–426.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.