Survey on Big Data using Apache Hadoop and Spark
Main Article Content
Abstract
Big data is growing rapidly concerning volume, variability, and velocity, making it difficult to process, capture, and analyze the data. Hadoop uses MapReduce, which consists of two parts: Map and Reduce. In contrast, Spark utilizes Resilient Distributed Datasets (RDD) and Directed Acyclic Graph (DAG) for processing large datasets. Both Hadoop and Spark use Hadoop Distributed File System (HDFS) to store data. This paper demonstrates the architecture and workings of Hadoop and Spark, highlighting their differences and the challenges faced by MapReduce during the processing of large datasets. Additionally, it explores how Spark operates on Hadoop YARN.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
Bobade, V. B. (2016). Survey Paper on Big Data and Hadoop. International Research Journal of Engineering and Technology (IRJET), 3(1), e-ISSN: 2395-0056, p-ISSN: 2395-0072.
Samuel, S. J., RVP, K., Sashidhar, K., & Bharathi, C. R. (2015). A Survey on Big Data and Its Research Challenges. ARPN Journal of Engineering and Applied Sciences, 10(8), ISSN 1819-6608.
Chavan, V., & Pursue, R. N. (2014). Survey Paper On Big Data. International Journal of Computer Science and Information Technologies (IJCSIT), 5(6), 7932-7939.
Verma, A., Mansuri, A. H., & Jain, N. (2016). Big Data Management Processing with Hadoop MapReduce and Spark Technology: A Comparison. In 2016 Symposium on Colossal Data Analysis and Networking (CDAN). IEEE. ISBN 978-1-5090-0669-4.
Huang, W., Meng, L., Zhang, D., & Zhang, W. (2016). In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(1), December.
Grolinger, K., Hayes, M., Higashino, W. A., L'Heureux, A., Allison, D. S., & Capretz, M. A. M. (2014). Challenges for MapReduce in Big Data. In Proceedings of the IEEE International Conference on Services Computing (SCC). DOI: 10.1109/SERVICES.2014.4. ISBN 978-1-4799-5069-0.
LIN, X., WANG, P., & WU, B. (2013). LOG ANALYSIS IN CLOUD COMPUTING ENVIRONMENT WITH HADOOP AND SPARK. In Proceedings of the IEEE International Conference on Cloud Computing (CLOUD). ISBN 978-1-4799-0094-7.
Lakshmi, K. N. M., et al. (2016). International Journal of Computer Engineering in Research Trends, 3(3), 134-142.
Mane, S. B., et al. (2017). Product Rating using Opinion Mining. International Journal of Computer Engineering in Research Trends, 4(5), 161-168.