TY - JOUR
T1 - Classification of web robots
T2 - An empirical study based on over one billion requests
AU - Lee, Junsup
AU - Cha, Sungdeok
AU - Lee, Dongkun
AU - Lee, Hyungkyu
N1 - Funding Information:
Authors would like to thank Microsoft Corporation, MSRA UR in particular, for its generous support without which research reported in this paper could not have been performed. MSRA provided us raw data as well as research gift. Authors also acknowledge support provided by the following organizations: Defense Acquisition Program Administration and Agency for Defense Development; Korea SW Industry Promotion Agency through Software Engineering Technologies Development and Experts Education program; MKE (Ministry of Knowledge Economy) and IITA, Korea through ITRC (Information Technology Research Centre) support program IITA-2008-(C1090-0801-0032); MKE/MCST/IITA program 2008-S-025-02 at ETRI; and a research grant from Korea University.
PY - 2009/11
Y1 - 2009/11
N2 - Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown
AB - Many studies on detection and classification of web robots have focused their attention mostly on text crawlers, and empirical experiments used relatively small data collected at universities. In this paper, we analyzed more than one billion requests to www.microsoft.com in 24 h. Web logs were made anonymous to eliminate potential privacy concerns while preserving essential characteristics (e.g., frequency, queries, etc). We have developed an effective characterization metrics, based on workload characteristics and resource types, in detecting and classifying various web robots including text crawlers, link checkers, and icon crawlers. As expected, web robot behavior was clearly different from that of typical interactive users, and different types of web robots also exhibited different characteristics. However, comparison of the similar type of web robots, text crawlers in particular, revealed different characteristics, thereby enabling characterization with reasonably high confidence level. We divided various feature metrics into five groups, and effectiveness of each group in classification is shown in polar diagram in the decreasing order of effectiveness in the clockwise direction. One can use the findings to classify likely identify of unknown web robots, and organizations can develop appropriate measures to deal with them. Our analysis is based on recent web log data collected at one of the best known site which offers truly global service. Crown
KW - Web robot characterization
KW - Web robot classification
KW - Web robot detection
KW - Web security
KW - Web usage mining
UR - http://www.scopus.com/inward/record.url?scp=71849091131&partnerID=8YFLogxK
U2 - 10.1016/j.cose.2009.05.004
DO - 10.1016/j.cose.2009.05.004
M3 - Article
AN - SCOPUS:71849091131
SN - 0167-4048
VL - 28
SP - 795
EP - 802
JO - Computers and Security
JF - Computers and Security
IS - 8
ER -