Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Slides:



Advertisements
Similar presentations
Generation of Multimedia TV News Contents for WWW Hsin Chia Fu, Yeong Yuh Xu, and Cheng Lung Tseng Department of computer science, National Chiao-Tung.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
CMo: When Less Is More Yevgen Borodin Jalal Mahmud I.V. Ramakrishnan Context-Directed Browsing for Mobiles.
Optimizing search engines using clickthrough data
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
A Quality Focused Crawler for Health Information Tim Tang.
Search Engines and Information Retrieval
H YPERLINKING DIGITAL LIBRARIES ON THE WEB Juan Camilo Zapata ITEC – 810 Supervisor Robert Dale 1.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Topics and Transitions: Investigation of User Search Behavior Xuehua Shen, Susan Dumais, Eric Horvitz.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
인지구조기반 마이닝 소프트컴퓨팅 연구실 박사 2 학기 박 한 샘 2006 지식기반시스템 응용.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Setting up a search engine KS 2 Search: appreciate how results are selected.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Experience Report: System Log Analysis for Anomaly Detection
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Data Mining Chapter 6 Search Engines
Multimedia Information Retrieval
Presentation transcript:

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006

Outline Motivation Proposed Solution Evaluation Conclusion

How search engine works Crawler downloads pages from the web. Indexer puts the content of the downloaded pages into index. For a given query, a relevance score of the query and each page that contains the query is calculated. Response list is generated based on the relevance scores.

Motivation Cloaking occurs when, for a given URL, different content is sent to browsers versus that sent to search engine crawlers. Some cloaking behavior is acceptable. Semantic cloaking (malicious cloaking) is the type of cloaking with the effect of deceiving search engines’ ranking algorithms.

Semantic cloaking example: keywords only sent to crawler game info, reviews, game reviews, previews, game previews, interviews, features, articles, feature articles, game developers, developers, developer diaries, strategy guides, game strategy, screenshots, screen shots, game screenshots, game screen shots, screens, forums, message boards, game forums, cheats, game cheats, cheat codes, playstation, playstation, dreamcast, Xbox, GameCube, game cube, gba, game, advance, software, game software, gaming software, files, game files, demos, game demos, play games, play games online, game release dates, Fargo, Daily Victim, Dork Tower, classics games, rpg, ………..

Task To build an automated system to detect semantic cloaking  based on the several copies of a same URL from both browsers’ and crawlers’ perspectives

How to collect data: UserAgent Browser:  Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) Crawler:  Googlebot/2.1 (+ B1C1B2C2 time

Outline Motivation Proposed Solution Evaluation Conclusion

Architecture Two copies B1 and C1 of each page Filtering Step Heuristic Rule Candidates from the first step Two more copies B2 and C2 for each candidate Classification Step Classifier Cloaked pages

Filtering Step To eliminate pages that do not employ semantic cloaking. Heuristic rules are used. For example, a rule might be:  to mark any page as long as the copy sent to the crawler contains a number of dictionary terms that don’t exist in the copy sent to the browser.

Classification Step A classifier is used.  E.g., Support Vector Machines, decision trees Operating on features including those from  Individual copies  Comparison of corresponding copies.

Features from individual copies Content-based  Number of terms in the page  Number of terms in the title field  Whether frame tag exists  …… Link-based  Number of links in the page  Number of links to a different site  Ratio of number of absolute links to the number of relative links.  ……

Features for corresponding copies Whether the number of terms in the keyword field of C1 is bigger than the one of B1 Whether the number of links in C2 is bigger than the one in B2 Number of common terms in C1 and B1 Number of links appearing only in B2, not in C2 ……

Building the classifier Joachims‘ SVM light is used. 162 features extracted for each URL. Data set:  47,170 unique pages (top 200 responses for popular queries).  We manually labeled 1,285 URLs, among which 539 are positive (semantic cloaking) and 746 are negative.

Training the classifier 60% of positive and 60% of negative examples are randomly selected for training and the rest are used for testing. Performance (average of five runs)  Accuracy: 91.3%  Precision: 93%  Recall: 85%

Discriminative features Whether the number of terms in the keyword field of the HTTP response header for C1 is bigger than the one for B1 Whether the number of unique terms in C1 is bigger than the one in B1 Whether C1 has the same number of relative links as B1 ………..

Outline Motivation Proposed Solution Evaluation Conclusion

Detecting semantic cloaking We used pages listed in dmoz Open Directory Project to demonstrate the value of our two-step architecture of detecting semantic cloaking. ODP 2004 gives us 4.3M URLs  Two copies of each of these URLs are downloaded for the filtering step.

Filtering step Rule: if the copy sent to crawler has more than three unique terms that do not exist in the copy sent to browser, or vice versa, the URL will be marked as a candidate. The filtering step marked 364,993 pages (4.3M pages in total) as candidates. All semantic cloaking of significance is marked.

Classification results For each of these 364,993 pages, two more copies are downloaded. The classifier (trained on the earlier data set) marked 46,806 pages as utilizing semantic cloaking. 400 random pages are selected from the 364,993 pages for manual evaluation.  Accuracy 96.8%  Precision 91.5%  Recall 82.7%

Semantic cloaking pages in DMOZ 46,806 * / = 51, M pages in total So, more than 1% of all pages within ODP are expected to utilize semantic cloaking

Semantic cloaking pages in ODP A. Arts E. Home I. Health M. Shopping B. Games F. Society J. Science N. Reference C. Recreation G. Kids&Teens K. Regional O. Business D. Sports H. Computers L. World P. News

Outline Motivation Proposed Solution Evaluation Conclusion

Discussion & Conclusion An automated system to detect semantic cloaking is possible! What if the spammers read this paper?  Need to be less ambitious to bypass the filtering step  Difficult to avoid all the features used in the classification step Future work  Better heuristic rules for the filtering step  More features to improve recall  IP-based semantic cloaking

Thank You! Baoning Wu