Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)

Main Article Content

Isha Mahajan
Harjinder Kaur
Dr. Darshan Kumar

Abstract

As the World Wide Web is becoming gigantic day by day, the number of web pages is increasing into billions around the world. To make searching much easier for users, search engines came into existence.  This search engine database is maintained by special software called a "Crawler." A Crawler is software that traverses the web and downloads web pages. Broad search engines, as well as many more specialized search tools, rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic, and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from the World Wide Web. Crawlers crawl only a fraction of web pages from the World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages.


The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. In our work, we propose to improve the crawling of a web crawler, to crawl only relevant and important pages from the WWW, which will lead to reduced server overheads. With our proposed architecture, we will also be optimizing the crawled data by removing least used or never browsed pages. The crawler needs a huge memory space or database for storing page content, etc. By not storing irrelevant and unimportant pages and never removing accessed pages, we will be saving a lot of memory space that will eventually speed up the queries to the database. In our approach, we propose to use the Extended Weighted Page Rank based on visits of links algorithm to sort the search results, which will reduce the search space for users by providing mostly visited pages and the most time-devoted pages by the user at the top of the search results list. Hence reducing the search space for the user.

Article Details

How to Cite
[1]
Isha Mahajan, Harjinder Kaur, and Dr. Darshan Kumar, “Enhancement in Crawling and Searching (Using Extended Weighted Page Rank Algorithm based on VOL)”, Int. J. Comput. Eng. Res. Trends, vol. 4, no. 6, pp. 202–230, Jun. 2017.
Section
Research Articles

References

Internet World Stats. (n.d.). Survey report. Retrieved from http://www.internetworldstats.com/stats.htm

Pew Research Center. (2012). Internet and American Life Project Survey report. Retrieved from http://www.pewinternet.org/2012/03/09/main-findings-11/

Average Traffic a website receives from a Search Engine. (n.d.). Retrieved from http://moz.com/community/q/what-is-the-average-percentageof-traffic-from-search-engines-that-a-website-receives

Size of World Wide Web. (n.d.). Retrieved from http://www.worldwidewebsize.com/

Castillo, C., Marin, M., Rodrigue, A., & Baeza-Yates, R. (2004). Scheduling Algorithms for Web Crawling. Proceedings of the Web Media & LA-Web 2004, 10-17. ISBN 0-7695-2237-8.

Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web. Science, 280(5360), 98–100.

Web Crawler. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Web_crawler

Web Crawler. (n.d.). In TechTarget. Retrieved from http://searchsoa.techtarget.com/definition/crawler

Chawla, A., & Ahuja, R. (2013). Crawling the Web: Discovery and Maintenance of Large-Scale Web Data. International Journal of Advances in Engineering Science (IJAES), 3, 62-66. ISSN: 2231-0347.

Gupta, S., Tarun, S., & Sharma, P. (2014). Controlling access of Bots and Spamming Bots. International Journal of Computer and Electronics Research (IJCER), 3(2), ISSN: 2278-5795.

Tuteja, S. (2013). Enhancement in Weighted PageRank Algorithm Using VOL. IOSR Journal of Computer Engineering (IOSR-JCE), 2(6), 135-141. ISSN: 2278-0661.

Agarwal, S., & Agarwal, B. B. (2013). An Improvement on Page Ranking Based on Visits of Links. International Journal of Science and Research (IJSR), 2(6), 265-268. ISSN: 2319-7064.

Brin, S., & Page, L. (1998). The Anatomy of a Large Scale Hypertextual Web Search Engine. Computer Network and ISDN Systems, 30(1-7), 107-117.

Xing, W., & Ghorbani, A. (2004). Weighted PageRank Algorithm. Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR ‟04). IEEE.

Kumar, G., Duahn, N., & Sharma, A. K. (2011). Page Ranking Based on Number of Visits of Web Pages. International Conference on Computer & Communication Technology (ICCCT)-2011, 978-1-4577-1385-9.

Tyagi, N., & Sharma, S. (2012). Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page. International Journal of Soft Computing and Engineering (IJSCE), 2(3), 441–446. ISSN: 2231-2307.

Tripathy, A., & Patra, P. K. (2008). A Web Mining Architectural Model of Distributed Crawler for Internet Searches Using PageRank Algorithm. Asia-Pacific Services Computing Conference, 513-518. 978-0-7695-3473-2/08 © 2008 IEEE.

Soon, L.-K., Ku, Y.-E., & Lee, S. H. (2012). Web Crawler with URL Signature – A Performance Study. Proceedings of the 4th Conference on Data Mining and Optimization (DMO), 127-130. 978-1-4673-2718-3/12 ©2012 IEEE.

Qureshi, F. R., & Khan, A. A. (2013). URL Signature with body text normalization in a web crawler. International Journal of Societal Applications of Computer Science (IJSACS), 2(3), 309-312. ISSN 2319 – 8443.

Pakhidde, S., Rajurkar, J., & Dahiwale, P. (2014). Content Relevance Prediction Algorithm in Web Crawlers to Enhance Web Search. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 3(3), 147-169. ISSN: 2278 – 1323.

Dahiwale, P., Bhowmik, P., Bhorkar, T., & Shahare, S. (2014). Rank Crawler: A Web Crawler with Relevance Prediction Mechanism for True Web Analysis. International Journal of Advance Foundation and Research in Computer (IJAFRC), 1(4), ISSN: 2348-4853.

HTTP_Referer. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/HTTP_referer

Url Normalization. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Url_normalization

MD5 Hashing Algorithm. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/MD5

WHOIS. (n.d.). In Wikipedia. Retrieved from http://en.wikipedia.org/wiki/Whois

Gupta, S., & Mahajan, P. (2014). Improvement in Weighted Page Rank based on Visits of Links (VOL) Algorithm. International Journal of Computer and Communications Engineering Research (IJCCER), 2(3), 119-124. ISSN: 2321-4198.

Gupta, S., & Tarun, S. (2014). Extended Architecture of Web Crawler. International Journal Of Computer & Electronics Research (IJCER), 3(3), 147-169. ISSN: 2278-5795.

Mahajan, I., Kaur, H., & Kumar, D. (2017). Extended Weighted Page Rank based on VOL by finding User Activities Time and Page Reading Time. International Journal of Engineering Works (IJEW), 7(2), 41-48. ISSN: 2409-2770.

Code Minification. (n.d.). In Google Developers. Retrieved from https://developers.google.com/speed/docs/insights/MinifyResources

Javascript Code Minification Api. (n.d.). Retrieved from https://javascript-minifier.com/

Cron Jobs. (n.d.). Retrieved from https://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800

Domain age calculating Api. (n.d.). Retrieved from https://github.com/99webtools/PHP-Domain-Age