22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

Slides:



Advertisements
Similar presentations
Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.
Advertisements

Incorporating Participant Reputation in Community-driven Question Answering Systems Liangjie Hong, Zaihan Yang and Brian D. Davison Computer Science and.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
TrustRank Algorithm Srđan Luković 2010/3482
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Detecting Web Spam with CombinedRank Abhita Chugh Ravi Tiruvury.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
The PageRank Citation Ranking “Bringing Order to the Web”
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Adversarial Information Retrieval The Manipulation of Web Content.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Page Rank Done by: Asem Battah Supervised by: Dr. Samir Tartir Done by: Asem Battah Supervised by: Dr. Samir Tartir.
Rate-based Data Propagation in Sensor Networks Gurdip Singh and Sandeep Pujar Computing and Information Sciences Sanjoy Das Electrical and Computer Engineering.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Ranking Link-based Ranking (2° generation) Reading 21.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB SPAM.
HITS Hypertext-Induced Topic Selection
A Comparative Study of Link Analysis Algorithms
CS 440 Database Management Systems
Junghoo “John” Cho UCLA
Using Link Information to Enhance Web Page Classification
Presentation transcript:

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Introduction: Web Search  Web search – the access to the Web for hundreds of millions of people  Hundreds of millions of queries per day  Queries + people = TRAFFIC  A HUGE incentive for web site owners to rank highly in search engine results Communicate some message (advertising, political statement) Install viruses, adware, etc. Google Yahoo! MSN Search Ask A9 Exalead Gigablast + metasearch + many more!

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Introduction: Web Spam  a.k.a. search engine spam, spamdexing  Any technique to manipulate search engine results Target page gets an undeservedly higher ranking  Many methods Link farms, keyword stuffing, cloaking, link bombs, and more  The target of much of our work!

Propagating Trust and Distrust to Demote Web Spam Baoning Wu, Vinay Goel, and Brian D. Davison Computer Science & Engineering Lehigh University Bethlehem, PA USA

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Outline  Background and motivation  Proposed methods  Experimental results

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Background: PageRank  (Page and Brin, 1998)  Uses number and status of “parents” to determine status of child  r (i+1) = (1-α) * T * r (i) + α * s r: PageRank score vector (with N nodes) T: transition matrix (NxN) (1-α): decay factor; α: jump probability s: uniform distribution of 1/N  PageRank score generates a ranking of importance of node

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Background: TrustRank  (Gyongyi and Garcia-Molina, VLDB 2004)  Uses number and trust of “parents” to determine trust status of child  t (i+1) = (1-α) * T * t (i) + α * s t: TrustRank score vector (with N nodes) T: transition matrix (NxN) (1-α): decay factor s: seed set trust score distribution  Vector of size N, but only seed nodes are non-zero  Demotes web spam by propagating trust from a known good seed set.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Specific Motivation  In TrustRank Parent divides its trust among its children. This may not be optimal – real-world trust relationships are independent of the number of trusted entities.  Distrust can also be propagated. A B Hyperlink Trust Propagation Distrust Propagation

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Key steps in propagation  Decay of trust (d) Trust is not perfectly transitive.  Splitting of trust For each parent, how to divide its score among its children.  Accumulation of trust For each child, how to accumulate the overall score given the portions from all of its parents.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Outline  Background and motivation  Proposed methods  Experimental results

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Choices for Trust Splitting  Given a node i with trust score TR(i) and O(i) outgoing links: Equal splitting  Gives d*TR(i)/O(i) to each child (used by TrustRank) Constant splitting  Gives d*TR(i) to each child Logarithmic splitting  Gives d*TR(i)/log(1+O(i)) to each child

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Choices for Trust Accumulation  Simple summation Sum the trust values from each parent  Maximum share Use the maximum of the trust values sent by the parents  Maximum parent Sum the trust values but never exceed the trust score of most-trusted parent

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Propagating Distrust  Distrust can be propagated from a seed set of bad nodes.  Similar to trust propagation, but in reverse – follow incoming links, not outgoing links  Same key choices for decay, splitting and accumulation

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Combining Trust and Distrust  For each node i, Trust score TR(i) and Distrust score DIS_TR(i), the combination score Total(i) can be Total(i) = ŋ * TR(i) – ß * DIS_TR(i) where 0 ≤ ŋ ≤ 1, 0 ≤ ß ≤ 1

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Outline  Background and motivation  Proposed methods  Experimental results

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Data set  20M pages from the Swiss search engine [search.ch] in 2004  350K sites with “.ch” domain We used only this site graph  Seed sets 3,589 labeled sites as using web spam with various techniques (provided) 20,005 sites with pages in dir.search.ch topics as trusted set

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Experimental Design  Explore various combinations of trust and distrust propagation  Evaluation Performance of TrustRank is the number of spam sites found among the highest- ranked ~1% of sites. We use the same metric in this work.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Baseline result AlgorithmNum. spam sites PageRank90 TrustRank58 Topical TrustRank (Wu et al., WWW2006) 33-42

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Simple TrustRank Improvement: Increase jump probability (α) (α)(α) default α=0.15

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Other trust propagation methods Algorithm Constant Splitting Logarithmic Splitting Decay= Simple Summation 364 Maximum Share Maximum Parent

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Results of propagating distrust Combined equally with TrustRank, 200 seeds Algorithm Constant Splitting Logarithmic Splitting d Distrust = Simple Summation Maximum Share Maximum Parent

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Combining trust and distrust Using best scoring trust and distrust formulations, beta=(1-eta) (Distrust Only)(Trust Only) >2200

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Coverage of trust propagation Algorithm Constant Splitting Logarithmic Splitting Decay Maximum Share Maximum Parent Percentage of sites affected by approach. TrustRank reached 76.05%.

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Conclusions  Propagating trust based on outdegree does not appear to be optimal.  Alternative splitting and accumulation methods can help to demote top ranked spam sites.  Propagating distrust can also help to demote top ranked spam sites.  Additional tests needed! E.g., to examine impact on retrieval

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop Thank You! Questions? Contact Info: Dr. Brian D. Davison davison(at)cse.lehigh.edu WUME Laboratory Computer Science and Engineering Lehigh University Bethlehem, PA USA The WUME Lab