Distinguishing humans from robots in web search logs preliminary results using query rates and intervals Omer Duskin Dror G. Feitelson School of Computer.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Recommender Systems & Collaborative Filtering
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Chapter 5: Introduction to Information Retrieval
Brief introduction on Logistic Regression
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Personalized Query Classification Bin Cao, Qiang Yang, Derek Hao Hu, et al. Computer Science and Engineering Hong Kong UST.
G. Alonso, D. Kossmann Systems Group
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
PROBLEM BEING ATTEMPTED Privacy -Enhancing Personalized Web Search Based on:  User's Existing Private Data Browsing History s Recent Documents 
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
COMP 630L Paper Presentation Javy Hoi Ying Lau. Selected Paper “A Large Scale Evaluation and Analysis of Personalized Search Strategies” By Zhicheng Dou,
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
IEEE 7th Annual Workshop on Workload Characterization The USAR Characterization Model Adriano Pereira, Gustavo Gorgulho, Leonardo Silva, Wagner Meira Jr.,
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
Chapter 1: Introduction to Statistics
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Gender based analysis Udayan Kumar Computer and Information Science and Engineering (CISE) Department, University Of Florida, Gainesville, FL.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Using Identity Credential Usage Logs to Detect Anomalous Service Accesses Daisuke Mashima Dr. Mustaque Ahamad College of Computing Georgia Institute of.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
Collaborative Filtering versus Personal Log based Filtering: Experimental Comparison for Hotel Room Selection Ryosuke Saga and Hiroshi Tsuji Osaka Prefecture.
1 Chapter Two: Sampling Methods §know the reasons of sampling §use the table of random numbers §perform Simple Random, Systematic, Stratified, Cluster,
Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim.
I can be You: Questioning the use of Keystroke Dynamics as Biometrics —Paper by Tey Chee Meng, Payas Gupta, Debin Gao Presented by: Kai Li Department of.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Measures of Reliability in Sports Medicine and Science Will G. Hopkins Sports Medicine 30(4): 1-25, 2000.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Pin-Yun Tarng / An Analysis of WoW Players’ Game Hours Network and Systems Laboratory nslab.ee.ntu.edu.tw IEEE/IFIP DSN 2008 Network and Systems Laboratory.
 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.
Post-Ranking query suggestion by diversifying search Chao Wang.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Quality Is in the Eye of the Beholder: Meeting Users ’ Requirements for Internet Quality of Service Anna Bouch, Allan Kuchinsky, Nina Bhatti HP Labs Technical.
An Overview of Statistics Section 1.1 After you see the slides for each section, do the Try It Yourself problems in your text for that section to see if.
Using a model of Social Dynamics to Predict Popularity of News Kristina Lerman, Tad Hogg USC Information Sciences Institute, Institute for Molecular Manufacturing.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Recommender Systems & Collaborative Filtering
Evaluating a Real-time Anomaly-based IDS
User Interface HEP Summit, DESY, May 2008
Detecting Online Commercial Intention (OCI)
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Web Mining Research: A Survey
WSExpress: A QoS-Aware Search Engine for Web Services
Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.
Journal of Web Semantics 55 (2019)
Presentation transcript:

Distinguishing humans from robots in web search logs preliminary results using query rates and intervals Omer Duskin Dror G. Feitelson School of Computer Science and Engineering The Hebrew University of Jerusalem, Israel Web Search and Web Data Mining 2009

Introduction The workload on web search engines is actually multiclass, being derived from the activities of both human users and automated robots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effect of robot activity.

EXAMPLE The same user issued the same query twice within less than a second. (Robots?) Another example is “users” that are actually a meta-search engine, and forward queries on behalf of many real users. (Human?)

Possible Solution Software agents use Boolean operators extensively while human users do not. Simply use a threshold of 100 queries to distinguish humans from robots –a user who submitted less than 100 queries is assumed to be a human, and one who submitted more is assumed to be a robot.

Goal 1.We want to filter out all non-human activity, in order to enable a more reliable characterization of human search behavior. 2.We want to catalog and describe the different robot profiles that are encountered in real systems.

Introduction The dimensions that we intend to investigate include the following: 1.The average rate or number of queries submitted: –We prefer rate over an absolute number to accommodate logs of different durations, but in this work we use the number.

Introduction 2.The minimal interval between successive queries: –Humans are expected to require tens of seconds or more to submit new queries. 3. The rate at which users type –with humans limited to around 200 characters per minute.

Introduction 4.The duration of sessions of continuous activity –If a user is active continuously for say more than 10 hours, we’ll assume it is a robot. 5.Correlation with time of day –which humans expected to be much more active during daytime. 6.The regularity of the submitted queries –Users who submit the same query thousands of times, or submit queries at precisely measured intervals, are assumed to be robots.

Introduction In this paper we take initial steps to characterize the first two items and the interaction between them. Specifically, we set thresholds on one attribute (say the number of submitted queries), and investigate the differences between the resulting groups of users in terms of the distribution of the other attribute (say the minimal interval between successive queries).

Data Source

Classification by Number of Queries

The intervals between repetitions of the same query are different than between different queries for humans, but the two distributions are very similar for robots.

Classification by Number of Queries This is quite obviously the result of robots that indeed submit the same queries repeatedly at 5 minute. The precisely repetitive behavior of these robots provides testimony that they are indeed robots and not humans.

Classification by Minimal Interval Between Queries

While the distributions have significant overlap, it is clear that the distribution for those users classified as humans has much more weight at low values.

Effect of parameters

we do see that for users with very high number of queries the intervals tend to be extremely low. For users with high intervals there tend to be only few queries.

Effect of parameters For humans, this indicates that practically all users whose minimal interval is longer than 25 seconds also submit fewer than 10 queries. For robot candidates, the vast majority of users who submit more than 10 queries have non-zero intervals, and a slight majority of users with zero intervals only submit few queries.

CONCLUSIONS Our results indicate that it may be impossible to find a simple threshold that can be used to perform a reliable classification. We also suggest using two thresholds to improve the confidence in the results.

Future work More systematic study of the different threshold values and their effect, considering using different thresholds for queries and for clicks, and integration of the additional user behavior attributes.

Thank You Q & A