TY - JOUR
T1 - Novel approaches to crawling important pages early
AU - Alam, Md Hijbul
AU - Ha, JongWoo W.
AU - Lee, Sang-Geun
N1 - Funding Information:
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011-0010325).
PY - 2012/12
Y1 - 2012/12
N2 - Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5% in cumulative PageRank.
AB - Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5% in cumulative PageRank.
KW - Crawl ordering
KW - Fractional PageRank
KW - PageRank
KW - Web crawler
UR - http://www.scopus.com/inward/record.url?scp=84869092092&partnerID=8YFLogxK
U2 - 10.1007/s10115-012-0535-4
DO - 10.1007/s10115-012-0535-4
M3 - Article
AN - SCOPUS:84869092092
SN - 0219-1377
VL - 33
SP - 707
EP - 734
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 3
ER -