Using Propagation of Distrust to find Untrustworthy Web Neighborhoods Panagiotis Takis Metaxas Computer Science Department Wellesley College, USA ICIW2009.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Trust and Propaganda in Cyberspace Panagiotis Takis Metaxas Computer Science Department Wellesley College.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Yahoo! Research Bradley Horowitz VP Product Strategy Yahoo, Inc. August 2006.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Internet Resources Discovery (IRD) Search Engines Quality.
The PageRank Citation Ranking “Bringing Order to the Web”
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Link Structure and Web Mining Shuying Wang
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Information Retrieval
Link Analysis HITS Algorithm PageRank Algorithm.
Overview of Web Data Mining and Applications Part I
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
The effects of Web Spam on The Evolution of Search Engines CS315-Web Search and Mining.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The effects of Web Spam on The Evolution of Search Engines CS315-Web Search and Mining.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Data Structures & Algorithms and The Internet: A different way of thinking.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Influence of Search Engines Christina Pong cs349.
Coverage and Independence: Measuring Quality in Web Search Results Panagiotis Takis Metaxas Lilia Ivanova Eni Mustafaraj Department of Computer Science.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
BING!-Microsoft's new search engine Launched May 28, 2009 Appealing interface A “decision engine” not just a search engine *Shopping, health, travel, local.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
Detecting Web Spam through Backward Propagation of Distrust CS315-Web Search and Mining.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
The effects of Web Spam on The Evolution of Search Engines CS315-Web Search and Mining.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB SPAM.
Quality of a search engine
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Data Mining Chapter 6 Search Engines
Web Search Engines.
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Using Propagation of Distrust to find Untrustworthy Web Neighborhoods Panagiotis Takis Metaxas Computer Science Department Wellesley College, USA ICIW2009 – Venice, Italy (May)

Outline of the Talk The role of Search Engines in Web experience What is Web Spam Why Search Engines evolve? Web Graph vs. Societal Trust Evolution of Search Engines: Backward Propagation of Distrust

Have you used the Web… to get informed? to help you make decisions? Financial Medical Political Religious… The Web is huge > 1 trillion (! ?) static pages publicly available, … and growing every day Much larger, if you count the “deep web” Infinite, if you count pages created on-the-fly We depend on search engines to find information

Web information can be unreliable Anyone can be an author on the web!

You know Spam…

The Web has Spam too! Search results steroid drug HGH (human growth hormone) Search results steroid drug HGH (human growth hormone)

Any controversial issue will be spammed Search results for mental disease ADHD (attention-deficit/hyperactivity disorder) Search results for mental disease ADHD (attention-deficit/hyperactivity disorder)

Political issues will be spammed Search results for Senatorial candidate John N. Kennedy, 2008 USA Elections

… you like it or not! But Google is usually so good in finding info… Why does it do that? Famous search results for “miserable failure”

Why? Web Spam: Attempt to modify the web (its structure and contents), and thus influence search engine results in ways beneficial to web spammers

crawl the web create inverted index Inverted index Search engine servers Document IDs How Google (and the other search engines) Work user query THE WEB Rank results

A Brief History of Search Engines 1st Generation (ca 1994): AltaVista, Excite, Infoseek… Ranking based on Content:  Pure Information Retrieval 2nd Generation (ca 1996): Lycos Ranking based on Content + Structure  Site Popularity 3rd Generation (ca 1998): Google, Teoma, Yahoo Ranking based on Content + Structure + Value  Page Reputation In the Works Ranking based on “the user’s need behind the query”

1st Generation: Content Similarity Content Similarity Ranking: The more rare words two documents share, the more similar they are Documents are treated as “bags of words” (no effort to “understand” the contents) Similarity is measured by vector angles Query Results are ranked by sorting the angles between query and documents How To Spam? t 1 d2d2 d 1 t 3 t 2 θ

1st Generation: How to Spam “Keyword stuffing”: Add keywords, text, to increase content similarity Page stuffed with casino- related keywords

2nd Generation: Add Popularity A hyperlink from a page in site A to some page in site B is considered a popularity vote from site A to site B Rank similar documents according to popularity How To Spam?

2nd Generation: How to Spam Create “Link Farms”: Heavily interconnected owned sites spam popularity Interconnected sites owned by vespro.com promote main site

3rd Generation: Add Reputation… The reputation “PageRank” of a page P i = the sum of a fraction of the reputations of all pages P j that point to P i Idea similar to academic co-citations Beautiful Math behind it PR = principal eigenvector of the web’s link matrix PR equivalent to the chance of randomly surfing to the page HITS algorithm tries to recognize “authorities” and “hubs” How To Spam?

3rd Generation: How to Spam Organize Mutual Admiration Societies: “link farms” of irrelevant reputable sites

Mutual Admiration Societies via Link Exchange

An Industry is Born “Search Engine Optimization” Companies Advertisement Consultants Conferences

3rd Generation: Reputation & Anchor Text Anchor text tells you what the reputation is about How To Spam? Page A Page B Anchor Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter

“Google-bombs” spam Anchor Text… Business weapons “more evil than satan” Political weapon in pre-election season “miserable failure” “waffles” “Clay Shaw” (+ 50 Republicans) Misinformation Promote steroids Discredit AD/HD research Activism / online protest “Egypt” “Jew” Other uses we do not know? “views expressed by the sites in your results are not in any way endorsed by Google…”

… mostly for political purposes Activists openly collaborating to Google-bomb search results of political opponents in 2006 “miserable failure hits Obama in January 2009

Search Engines vs Web Spam Search Engine’s Action 1st Generation: Similarity Content 2nd Generation: + Popularity Content + Structure 3rd Generation: + Reputation + Anchor Text Content + Structure + Value 4th Generation (in the Works) Ranking based on the user’s “need behind the query” Web Spammers Reaction Add keywords so as to increase content similarity + Create “link farms” of heavily interconnected sites + Organize “mutual admiration societies” of irrelevant reputable sites+ Googlebombs ?? Is there a pattern on how to spam? Can you guess what they will do?

And Now For Something Completely(?) Different Propaganda: Attempt to modify human behavior, and thus influence people’s actions in ways beneficial to propagandists Theory of Propaganda Developed by the Institute for Propaganda Analysis Propagandistic Techniques (and ways of detecting propaganda) Word games - associate good/bad concept with social entity  Glittering Generalities — Name Calling Transfer - use special privileges (e.g., office) to breach trust Testimonial - famous non-experts’ claims Plain Folk - people like us think this way Bandwagon - everybody’s doing it, jump on the wagon Card Stacking - use of bad logic

Societal Trust is (also) a Graph Propaganda: Attempt to modify the Societal Trust Graph and thus influence people in ways beneficial to propagandist Web Spam: Attempt to modify the Web Graph, and thus influence users through search engine results in ways beneficial to web spammers

Web Spammers as Propagandists Web Spammers can be seen as employing propagandistic techniques in order to modify the Web Graph There is a pattern on how to spam!

Propaganda in Graph Terms Word Games Name Calling Glittering Generalities Transfer Testimonial Plain Folk Card stacking Bandwagon Modify Node weights Decrease node weight Increase node weight Modify Node content + keep weights Insert Arcs b/w irrelevant nodes Modify Arcs Mislabel Arcs Modify Arcs & generate nodes

Anti-Propagandistic Lessons for Web How do you deal with propaganda in real life? Backwards propagation of distrust The recommender of an untrustworthy message becomes untrustworthy Can you transfer this technique to the web?

An Anti-Propagandistic Algorithm Start from untrustworthy site s S = {s} Using BFS for depth D do: Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site) Ignore blogs, directories, edu’s S = S + U Find the bi-connected component BCC of U that includes s BCC shows multiple paths to boost the reputation of s

Backwards Propagation of Distrust Start from untrustworthy site s S = {s} Using BFS for depth D do: Find the set U of sites linking to sites in S (using the Google API for up to B b-links/site) Ignore blogs, directories, edu’s S = S + U Find the bi-connected component BCC of U that includes s BCC shows multiple paths to boost the reputation of s

BCC vs Periphery Since the BCC reveals multiple paths to boost the reputation of s, we expect it to contain a higher percentage of untrustworthy sites The Periphery of the BCC, on the other hand, should have significantly lower percentage of untrustworthy sites BCC Periphery

Explored neighborhoods

Evaluated Experimental Results The trustworthiness of starting site is a very good predictor for the trustworthiness of BCC sites The BCC is significantly more predictive of untrustworthiness than the Periphery BCC Periphery

Living in Cyberspace Critical Thinking, Education Realize how do we know what we know “Of course it’s true; I saw it on the Internet!” Cyber-social Structures that mimic Societal ones Know why to trust or distrust Who do you trust on a particular subject? A Search Engine per Browser Easier to fool one search engine than to fool millions of readers Enable the reader to keep track of her trust network Tools of cyber trust How would you avoid the Emulex hoax?

Thank You!

How (not) To Solve The Problem

Link Farms vs MAS