A Scalable Real-Time Event Prediction System for Distributed Networks Using Online Random Forest and CluStream

Main Article Content

R.Anil Kular
A Malla Reddy
K Samunnisa

Abstract

This paper presents a robust architecture designed for real-time event prediction in distributed networks, utilizing Online Random Forest (ORF) and CluStream for incremental learning and dynamic clustering. The system addresses challenges posed by high-velocity, large-scale data streams, incorporating adaptive sliding windows and real-time data processing to ensure scalability, low latency, and accuracy. Comparative analysis against traditional models, including Naive Bayes and Support Vector Machines, reveals that the proposed system achieves superior predictive accuracy (91.5%), precision (92%), and recall (88%) while maintaining an F1 score of 90%. Clustering efficiency is significantly improved through CluStream, which dynamically manages evolving data streams with lower clustering time compared to conventional methods like K-Means. However, as data stream size increases, latency grows from 120ms for small streams (10MB) to 850ms for large streams (1000MB), indicating a need for further optimization at extreme scales. The system is suitable for applications in network security, IoT monitoring, and large-scale real-time analytics. Despite its strengths, limitations include resource consumption and challenges in managing highly volatile or unstructured data. Future enhancements may focus on reducing latency for larger data streams and improving adaptability to extreme concept drift. This research demonstrates a scalable, efficient, and adaptive approach to real-time event prediction in distributed environments.

Article Details

How to Cite
[1]
R.Anil Kular, A Malla Reddy, and K Samunnisa, “A Scalable Real-Time Event Prediction System for Distributed Networks Using Online Random Forest and CluStream”, Int. J. Comput. Eng. Res. Trends, vol. 11, no. 6, pp. 43–56, Jun. 2024.
Section
Research Articles

References

] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2002, pp. 1-16.

] G. Cormode and M. Garofalakis, "Sketching streams through the net: Distributed approximate query tracking," in Proceedings of the 31st International Conference on Very Large Data Bases, 2005, pp. 13-24.

] M. Datar, A. Gionis, P. Indyk, and R. Motwani, "Maintaining stream statistics over sliding windows," SIAM Journal on Computing, vol. 31, no. 6, pp. 1794-1813, 2002.

] A. Bifet and R. Gavalda, "Learning from time-changing data with adaptive windowing," in Proceedings of the 2007 SIAM International Conference on Data Mining, 2007, pp. 443-448.

] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "A framework for clustering evolving data streams," in Proceedings of the 29th International Conference on Very Large Data Bases, 2003, pp. 81-92.

] S. Shalev-Shwartz, "Online learning and online convex optimization," Foundations and Trends in Machine Learning, vol. 4, no. 2, pp. 107-194, 2012.

] A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof, "Online random forests," in Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops, 2009, pp. 1393-1400.

] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, "Long short term memory networks for anomaly detection in time series," in Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2015.

] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, "Mining data streams: A review," ACM SIGMOD Record, vol. 34, no. 2, pp. 18-26, 2005.

] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM Computing Surveys, vol. 46, no. 4, pp. 1-37, 2014.

] A. Bifet, G. Holmes, B. Pfahringer, and R. Kirkby, "MOA: Massive online analysis," Journal of Machine Learning Research, vol. 11, pp. 1601-1604, 2010.

] R. Klinkenberg, "Learning drifting concepts: Example selection vs. example weighting," Intelligent Data Analysis, vol. 8, no. 3, pp. 281-300, 2004.

] Apache Kafka, "Apache Kafka Documentation." [Online]. Available: https://kafka.apache.org/documentation. [Accessed: Sep. 09, 2024].

] Apache Flink, "Apache Flink Documentation." [Online]. Available: https://flink.apache.org. [Accessed: Sep. 09, 2024].

] C. Carbone, A. Katsifodimos, S. Haridi, and V. Markl, "Apache Flink: Stream and batch processing in a single engine," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 38, no. 4, pp. 28-38, 2015.

] P. P. C. Lee, T. Bu, and T. Woo, "Monitoring high-speed data streams," Journal of Parallel and Distributed Computing, vol. 71, no. 2, pp. 277-287, 2011.

] G. Cormode and S. Muthukrishnan, "What's new: Finding significant differences in network data streams," IEEE/ACM Transactions on Networking, vol. 13, no. 6, pp. 1219-1232, 2005.

] Y. Zhu and D. Shasha, "StatStream: Statistical monitoring of thousands of data streams in real-time," in Proceedings of the 28th International Conference on Very Large Data Bases, 2002, pp. 358-369.

] G. Widmer and M. Kubat, "Learning in the presence of concept drift and hidden contexts," Machine Learning, vol. 23, no. 1, pp. 69-101, 1996.

] C. C. Aggarwal, "Data streams: Models and algorithms," in Advances in Database Systems, vol. 31, New York: Springer, 2007.

] P. Domingos and G. Hulten, "Mining high-speed data streams," in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 71-80.

] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp. 107-113, 2008.

] G. Hinton, L. Deng, D. Yu, et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.

] M. Krawczyk, B. M. Krawczyk, and J. Stefanowski, "Data stream analysis: The learning process in non-stationary environments," IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 3, pp. 533-551, 2018.

Most read articles by the same author(s)