SMS SPAM Filtering With Machine Learning Model Distilbert

Authors

  • C. Naveenkumar M.C.A Student, Department of MCA, KMMIPS, Tirupati (D.T), Andhra Pradesh, India Author
  • Dr. K. Venkataramana Professor, Department of MCA, KMMIPS, Tirupati (D.T), Andhra Pradesh, India Author

Keywords:

SMS Spam Detection, BERT, NLP, Transformer Models, Text Classification

Abstract

This work crops up and looks to address the fact that SMS spam detection involves transforming the language by latest technology such as transformers with BERT as its representative. Hijacked unsolicited messages pose serious threats to user privacy, network efficiency, and integrity in conversations. Traditional rule-based models and shallow machine learning techniques usually do not stand ground concerning their methods of capturing the continuously changing linguistic expressions of spam messages. Therefore, we have sought after the possibilities of understanding the contextual relationships by using BERT, the best model in natural language. A public dataset from Kaggle with spam messages and labelled ham messages was used in fine-tuning the model for training and testing. The data was processed and fed into the BERT model for fine-tuning in a binary classification setting. Performance metrics for the model were derived from the standard metrics such as accuracy, precision, recall, and F1 score. BERT is best poised to capture exceedingly subtle language features and detect spam messages accurately; thus, it proves to be a viable option for any real-time SMS filtering systems.

Downloads

Download data is not yet available.

References

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the NAACL-HLT, 2019, pp. 4171–4186.

Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All You Need,” arXiv preprint arXiv:1706.03762, 2017.

T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the Study of SMS Spam Filtering: New Collection and Results,” in Proc. of the 11th ACM Symposium on Document Engineering, 2011, pp. 259–262.

C. Sun, X. Qiu, and X. Huang, “How to Fine-Tune BERT for Text Classification,” in Chinese Computational Linguistics – CCL 2019, pp. 194–206.

S. Wadhawan and S. Shandilya, “SMS Spam Detection Using BERT and NLP Techniques,” International Journal of Computer Applications, vol. 183, no. 18, pp. 25–30, 2021.

S. Minaee, N. Kalchbrenner, E. Cambria, M. Chenaghlu, and J. Gao, “Deep Learning Based Text Classification: A Comprehensive Review,” ACM Computing Surveys, vol. 54, no. 3, pp. 1–40, 2021.

Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding Convolutional Neural Networks for Text Classification,” in Proceedings of IJCAI, 2015, pp. 2371–2377.

J. Howard and S. Ruder, “Universal Language Model Fine-tuning for Text Classification,” in Proceedings of ACL, 2018, pp. 328–339.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019.

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” arXiv preprint arXiv:1906.08237, 2019.

Y. Zhou and L. Chen, “Spam Message Detection Based on BERT and CNN-LSTM Model,” Journal of Intelligent & Fuzzy Systems, vol. 39, no. 5, pp. 6781–6789, 2020.

R. Mishra and R. Bhatnagar, “Real-Time SMS Spam Filtering Using Transformer-Based NLP Models,” International Journal of Information Technology, vol. 14, no. 3, pp. 1105–1112, 2022.

K. Ravi and V. Ravi, “A Survey on Opinion Mining and Sentiment Analysis: Tasks, Approaches and Applications,” Knowledge-Based Systems, vol. 89, pp. 14–46, 2015.

V. Kharde and P. Sonawane, “Text Mining and Sentiment Analysis of Twitter Data,” International Journal of Computer Science and Information Technologies, vol. 7, no. 6, pp. 2343–2346, 2016.

W. Y. Wang, “Liar, Liar Pants on Fire: A New Benchmark Dataset for Fake News Detection,” in Proceedings of ACL, 2017, pp. 422–426.

Downloads

Published

05-05-2025

Issue

Section

Research Articles