Enhanced Text Classification Using Random Forest: Comparative Analysis and Insights on Performance and Efficiency
Main Article Content
Abstract
Text classification is a critical task in natural language processing (NLP), with applications ranging from spam detection to sentiment analysis and document categorisation. This research investigates the application of the Random Forest (RF) algorithm for text classification, highlighting its performance relative to traditional classifiers, such as Naive Bayes (NB), Support Vector Machines (SVM), and Logistic Regression (LR). Experiments were conducted on the 20 Newsgroups dataset, which is a benchmark text dataset characterised by high-dimensional and sparse data. The methodology involved rigorous preprocessing steps, including tokenisation, stopword removal, stemming, and vectorisation, using TF-IDF. Random Forest achieved superior performance, with an accuracy of 89.3% and an F1-score of 88.1%, surpassing SVM (87.6% accuracy, 87.2% F1-score), LR (86.3% accuracy, 86.0% F1-score), and NB (82.4% accuracy, 81.9% F1-score). RF also demonstrated competitive computational efficiency, with a training time of 2.0 seconds, making it faster than SVM (3.8 s) and comparable to LR (2.5 s). The results underscore the robustness of RF in handling high-dimensional text data, offering a balance between predictive performance and computational efficiency. However, the study identified certain limitations, such as dependence on TF-IDF for feature representation and the scope of experiments being limited to a single dataset. Future work will focus on integrating RF with advanced text embeddings and expanding its evaluation to diverse datasets. These findings establish RF as a reliable, interpretable, and efficient classifier for text classification tasks, with significant potential for further enhancement in modern NLP applications.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. 10th Eur. Conf. Mach. Learn. (ECML), 1998, pp. 137–142.
B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008.
C. C. Aggarwal and C. Zhai, “A survey of text classification algorithms,” in Mining Text Data, Springer, 2012, pp. 163–222.
A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in Proc. AAAI Workshop on Learning for Text Categorization, 1998, pp. 41–48.
Y. Kim, “Convolutional neural networks for sentence classification,” in Proc. Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746–1751.
L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
S. E. Robertson and K. S. Jones, “Relevance weighting of search terms,” J. Amer. Soc. Inform. Sci., vol. 27, no. 3, pp. 129–146, 1976.
X. Chen, X. Cheng, Z. Zhang, and Z. Xie, “Efficient text classification using random forests,” in Proc. IEEE Int. Conf. Data Mining Workshops (ICDMW), 2020, pp. 157–164.
J. Ramos, “Using TF-IDF to determine word relevance in document queries,” in Proc. First Int. Conf. Mach. Learn., 2003, pp. 133–142.
A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in Proc. AAAI Workshop on Learning for Text Categorization, 1998, pp. 41–48.
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. 10th Eur. Conf. Mach. Learn. (ECML), 1998, pp. 137–142.
L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
X. Chen, X. Cheng, Z. Zhang, and Z. Xie, “Efficient text classification using random forests,” in Proc. IEEE Int. Conf. Data Mining Workshops (ICDMW), 2020, pp. 157–164.
J. Ramos, “Using TF-IDF to determine word relevance in document queries,” in Proc. First Int. Conf. Mach. Learn., 2003, pp. 133–142.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space”, in Proc. Int. Conf. Learn. Representations (ICLR), 2013, pp. 1–12.
J. Ramos, “Using TF-IDF to determine word relevance in document queries,” in Proc. First Int. Conf. Mach. Learn., 2003, pp. 133–142.
D. Smith and J. Brown, “Spam detection using random forests: A feature engineering approach,” in Proc. IEEE Int. Conf. Data Mining (ICDM), 2020, pp. 112–119.
X. Chen, X. Cheng, Z. Zhang, and Z. Xie, “Efficient text classification using random forests,” in Proc. IEEE Int. Conf. Data Mining Workshops (ICDMW), 2021, pp. 157–164.
L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space”, in Proc. Int. Conf. Learn. Representations (ICLR), 2013, pp. 1–12.
K. Lang, "The 20 Newsgroups Dataset," 1995. [Online]. Available: http://qwone.com/~jason/20Newsgroups/. [Accessed: Dec. 5, 2024].
A. McCallum and K. Nigam, “A comparison of event models for naive Bayes text classification,” in Proc. AAAI Workshop on Learning for Text Categorization, 1998, pp. 41–48.
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Proc. 10th Eur. Conf. Mach. Learn. (ECML), 1998, pp. 137–142.
D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, 2nd ed. New York, NY, USA: Wiley, 2000.