Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)
Main Article Content
Abstract
As the World Wide Web is becoming gigantic day by day, the number of web pages is increasing into billions around the world. To make searching much easier for users, search engines came into existence. This search engine database is maintained by special software called a "Crawler." A Crawler is software that traverses the web and downloads web pages. Broad search engines, as well as many more specialized search tools, rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic, and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from the World Wide Web. Crawlers crawl only a fraction of web pages from the World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages.
The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. In our work, we propose to improve the crawling of a web crawler, to crawl only relevant and important pages from the WWW, which will lead to reduced server overheads. With our proposed architecture, we will also be optimizing the crawled data by removing least used or never browsed pages. The crawler needs a huge memory space or database for storing page content, etc. By not storing irrelevant and unimportant pages and never removing accessed pages, we will be saving a lot of memory space that will eventually speed up the queries to the database. In our approach, we propose to use the Extended Weighted Page Rank based on visits of links algorithm to sort the search results, which will reduce the search space for users by providing mostly visited pages and the most time-devoted pages by the user at the top of the search results list. Hence reducing the search space for the user.
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
IJCERT Policy:
The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.
By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.
References
Internet World Stats. (n.d.). Survey report. Retrieved from http://www.internetworldstats.com/stats.htm
Pew Research Center. (2012). Internet and American Life Project Survey report. Retrieved from http://www.pewinternet.org/2012/03/09/main-findings-11/
Average Traffic a website receives from a Search Engine. (n.d.). Retrieved from http://moz.com/community/q/what-is-the-average-percentageof-traffic-from-search-engines-that-a-website-receives
Size of World Wide Web. (n.d.). Retrieved from http://www.worldwidewebsize.com/
Castillo, C., Marin, M., Rodrigue, A., & Baeza-Yates, R. (2004). Scheduling Algorithms for Web Crawling. Proceedings of the Web Media & LA-Web 2004, 10-17. ISBN 0-7695-2237-8.
Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360), 98–100.
Web Crawler. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Web_crawler
Web Crawler. (n.d.). In TechTarget. Retrieved from http://searchsoa.techtarget.com/definition/crawler
Chawla, A., & Ahuja, R. (2013). Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. International Journal of Advances in Engineering Science (IJAES), 3, 62-66. ISSN: 2231-0347.
Gupta, S., Tarun, S., & Sharma, P. (2014). Controlling access of Bots and Spamming Bots. International Journal of Computer and Electronics Research (IJCER), 3(2), ISSN: 2278-5795.
Tuteja, S. (2013). Enhancement in Weighted PageRank Algorithm Using VOL. IOSR Journal of Computer Engineering (IOSR-JCE), 2(6), 135-141. ISSN: 2278-0661.
Agarwal, S., & Agarwal, B. B. (2013). An Improvement on Page Ranking Based on Visits of Links. International Journal of Science and Research (IJSR), 2(6), 265-268. ISSN: 2319-7064.
Brin, S., & Page, L. (1998). The Anatomy of a Large Scale Hypertextual Web Search Engine. Computer Network and ISDN Systems, 30(1-7), 107-117.
Xing, W., & Ghorbani, A. (2004). Weighted PageRank Algorithm. Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR ‟04). IEEE.
Kumar, G., Duahn, N., & Sharma, A. K. (2011). Page Ranking Based on Number of Visits of Web Pages. International Conference on Computer & Communication Technology (ICCCT)-2011, 978-1-4577-1385-9.
Tyagi, N., & Sharma, S. (2012). Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page. International Journal of Soft Computing and Engineering (IJSCE), 2(3), 441–446. ISSN: 2231-2307.
Tripathy, A., & Patra, P. K. (2008). A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm. Asia-Pacific Services Computing Conference, 513-518. 978-0-7695-3473-2/08 © 2008 IEEE.
Soon, L.-K., Ku, Y.-E., & Lee, S. H. (2012). Web Crawler with URL Signature – A Performance Study. Proceedings of the 4th Conference on Data Mining and Optimization (DMO), 127-130. 978-1-4673-2718-3/12 ©2012 IEEE.
Qureshi, F. R., & Khan, A. A. (2013). URL Signature with body text normalization in a web crawler. International Journal of Societal Applications of Computer Science (IJSACS), 2(3), 309-312. ISSN 2319 – 8443.
Pakhidde, S., Rajurkar, J., & Dahiwale, P. (2014). Content Relevance Prediction Algorithm in Web Crawlers to Enhance Web Search. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 3(3), 147-169. ISSN: 2278 – 1323.
Dahiwale, P., Bhowmik, P., Bhorkar, T., & Shahare, S. (2014). Rank Crawler: A Web Crawler with Relevance Prediction Mechanism for True Web Analysis. International Journal of Advance Foundation and Research in Computer (IJAFRC), 1(4), ISSN: 2348-4853.
HTTP_Referer. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/HTTP_referer
Url Normalization. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Url_normalization
MD5 Hashing Algorithm. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/MD5
WHOIS. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Whois
Gupta, S., & Mahajan, P. (2014). Improvement in Weighted Page Rank based on Visits of Links (VOL) Algorithm. International Journal of Computer and Communications Engineering Research (IJCCER), 2(3), 119-124. ISSN: 2321-4198.
Gupta, S., & Tarun, S. (2014). Extended Architecture of Web Crawler. International Journal Of Computer & Electronics Research (IJCER), 3(3), 147-169. ISSN: 2278-5795.
Mahajan, I., Kaur, H., & Kumar, D. (2017). Extended Weighted Page Rank based on VOL by finding User Activities Time and Page Reading Time. International Journal of Engineering Works (IJEW), 7(2), 41-48. ISSN: 2409-2770.
Code Minification. (n.d.). In Google Developers. Retrieved from https://developers.google.com/speed/docs/insights/MinifyResources
Javascript Code Minification Api. (n.d.). Retrieved from https://javascript-minifier.com/
Cron Jobs. (n.d.). Retrieved from https://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800
Domain age calculating Api. (n.d.). Retrieved from https://github.com/99webtools/PHP-Domain-Age