Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)

Isha Mahajan; Harjinder Kaur; Dr. Darshan Kumar

PDF

Published: Jun 30, 2017

Keywords:

— Web Crawler, Extended Weighted Page Rank based on Visits of links, Weighted Page Rank, Page Rank, Page Rank based on visit of links, Search Engine, Crawling, bot, Information Retrieval Engine, Page Reading Time, User Attention Time, World Wide Web, Inlinks, Outlines, Web informational retrieval, online search.

Isha Mahajan

Department of Computer Science & Engineering SSIET, Dinanagar - 143531, Distt. Gurdaspur, Punjab (India)

Harjinder Kaur

Department of Computer Science & Engineering SSIET, Dinanagar - 143531, Distt. Gurdaspur, Punjab (India)

Dr. Darshan Kumar

Department of Computer Science & Engineering SSIET, Dinanagar - 143531, Distt. Gurdaspur, Punjab (India)

Abstract

As the World Wide Web is becoming gigantic day by day, the number of web pages is increasing into billions around the world. To make searching much easier for users, search engines came into existence. This search engine database is maintained by special software called a "Crawler." A Crawler is software that traverses the web and downloads web pages. Broad search engines, as well as many more specialized search tools, rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic, and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from the World Wide Web. Crawlers crawl only a fraction of web pages from the World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages.

The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. In our work, we propose to improve the crawling of a web crawler, to crawl only relevant and important pages from the WWW, which will lead to reduced server overheads. With our proposed architecture, we will also be optimizing the crawled data by removing least used or never browsed pages. The crawler needs a huge memory space or database for storing page content, etc. By not storing irrelevant and unimportant pages and never removing accessed pages, we will be saving a lot of memory space that will eventually speed up the queries to the database. In our approach, we propose to use the Extended Weighted Page Rank based on visits of links algorithm to sort the search results, which will reduce the search space for users by providing mostly visited pages and the most time-devoted pages by the user at the top of the search results list. Hence reducing the search space for the user.

How to Cite

[1]

Isha Mahajan, Harjinder Kaur, and Dr. Darshan Kumar, “Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)”, Int. J. Comput. Eng. Res. Trends, vol. 4, no. 6, pp. 202–230, Jun. 2017.

Issue

Vol. 4 No. 6 (2017): June (2017) Issue

Section

Research Articles

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

IJCERT Policy:

The published work presented in this paper is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This means that the content of this paper can be shared, copied, and redistributed in any medium or format, as long as the original author is properly attributed. Additionally, any derivative works based on this paper must also be licensed under the same terms. This licensing agreement allows for broad dissemination and use of the work while maintaining the author's rights and recognition.

By submitting this paper to IJCERT, the author(s) agree to these licensing terms and confirm that the work is original and does not infringe on any third-party copyright or intellectual property rights.

References

Internet World Stats. (n.d.). Survey report. Retrieved from http://www.internetworldstats.com/stats.htm

Pew Research Center. (2012). Internet and American Life Project Survey report. Retrieved from http://www.pewinternet.org/2012/03/09/main-findings-11/

Average Traffic a website receives from a Search Engine. (n.d.). Retrieved from http://moz.com/community/q/what-is-the-average-percentageof-traffic-from-search-engines-that-a-website-receives

Size of World Wide Web. (n.d.). Retrieved from http://www.worldwidewebsize.com/

Castillo, C., Marin, M., Rodrigue, A., & Baeza-Yates, R. (2004). Scheduling Algorithms for Web Crawling. Proceedings of the Web Media & LA-Web 2004, 10-17. ISBN 0-7695-2237-8.

Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360), 98–100.

Web Crawler. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Web_crawler

Web Crawler. (n.d.). In TechTarget. Retrieved from http://searchsoa.techtarget.com/definition/crawler

Chawla, A., & Ahuja, R. (2013). Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. International Journal of Advances in Engineering Science (IJAES), 3, 62-66. ISSN: 2231-0347.

Gupta, S., Tarun, S., & Sharma, P. (2014). Controlling access of Bots and Spamming Bots. International Journal of Computer and Electronics Research (IJCER), 3(2), ISSN: 2278-5795.

Tuteja, S. (2013). Enhancement in Weighted PageRank Algorithm Using VOL. IOSR Journal of Computer Engineering (IOSR-JCE), 2(6), 135-141. ISSN: 2278-0661.

Agarwal, S., & Agarwal, B. B. (2013). An Improvement on Page Ranking Based on Visits of Links. International Journal of Science and Research (IJSR), 2(6), 265-268. ISSN: 2319-7064.

Brin, S., & Page, L. (1998). The Anatomy of a Large Scale Hypertextual Web Search Engine. Computer Network and ISDN Systems, 30(1-7), 107-117.

Xing, W., & Ghorbani, A. (2004). Weighted PageRank Algorithm. Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR ‟04). IEEE.

Kumar, G., Duahn, N., & Sharma, A. K. (2011). Page Ranking Based on Number of Visits of Web Pages. International Conference on Computer & Communication Technology (ICCCT)-2011, 978-1-4577-1385-9.

Tyagi, N., & Sharma, S. (2012). Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page. International Journal of Soft Computing and Engineering (IJSCE), 2(3), 441–446. ISSN: 2231-2307.

Tripathy, A., & Patra, P. K. (2008). A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm. Asia-Pacific Services Computing Conference, 513-518. 978-0-7695-3473-2/08 © 2008 IEEE.

Soon, L.-K., Ku, Y.-E., & Lee, S. H. (2012). Web Crawler with URL Signature – A Performance Study. Proceedings of the 4th Conference on Data Mining and Optimization (DMO), 127-130. 978-1-4673-2718-3/12 ©2012 IEEE.

Qureshi, F. R., & Khan, A. A. (2013). URL Signature with body text normalization in a web crawler. International Journal of Societal Applications of Computer Science (IJSACS), 2(3), 309-312. ISSN 2319 – 8443.

Pakhidde, S., Rajurkar, J., & Dahiwale, P. (2014). Content Relevance Prediction Algorithm in Web Crawlers to Enhance Web Search. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 3(3), 147-169. ISSN: 2278 – 1323.

Dahiwale, P., Bhowmik, P., Bhorkar, T., & Shahare, S. (2014). Rank Crawler: A Web Crawler with Relevance Prediction Mechanism for True Web Analysis. International Journal of Advance Foundation and Research in Computer (IJAFRC), 1(4), ISSN: 2348-4853.

HTTP_Referer. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/HTTP_referer

Url Normalization. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Url_normalization

MD5 Hashing Algorithm. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/MD5

WHOIS. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Whois

Gupta, S., & Mahajan, P. (2014). Improvement in Weighted Page Rank based on Visits of Links (VOL) Algorithm. International Journal of Computer and Communications Engineering Research (IJCCER), 2(3), 119-124. ISSN: 2321-4198.

Gupta, S., & Tarun, S. (2014). Extended Architecture of Web Crawler. International Journal Of Computer & Electronics Research (IJCER), 3(3), 147-169. ISSN: 2278-5795.

Mahajan, I., Kaur, H., & Kumar, D. (2017). Extended Weighted Page Rank based on VOL by finding User Activities Time and Page Reading Time. International Journal of Engineering Works (IJEW), 7(2), 41-48. ISSN: 2409-2770.

Code Minification. (n.d.). In Google Developers. Retrieved from https://developers.google.com/speed/docs/insights/MinifyResources

Javascript Code Minification Api. (n.d.). Retrieved from https://javascript-minifier.com/

Cron Jobs. (n.d.). Retrieved from https://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800

Domain age calculating Api. (n.d.). Retrieved from https://github.com/99webtools/PHP-Domain-Age

Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)

Abstract

References

QUICK LINKS

FOR AUTHORS

FOR REVIEWERS

JOURNAL CONTENTS

DOWNLOADS

Article Sidebar

Main Article Content

Abstract

Article Details

References