WEB SPAM.

Slides:



Advertisements
Similar presentations
TrustRank Algorithm Srđan Luković 2010/3482
Advertisements

What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.
Search Engine Optimization By Andy Smith | Art Institute of Dallas.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
WageIndicator SEO, December 10, 2008 Irene van Beveren Today: 0.Why SEO is important 1.Keyword Strategies 2.Title Tags 3.Internal Links 4.Duplicate Content.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Search Engine Optimization
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Adversarial Information Retrieval The Manipulation of Web Content.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
SEO techniques & Mastering Google Adwords By Ganesh.S
Lecture 2: Extensions of PageRank
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
What Is SEO? Search engine optimization (SEO) is the art and science of publishing and marketing information that ranks well for valuable keywords in.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Week 1 Introduction to Search Engine Optimization.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Week 5  SEO  CSS Please Visit: to download all the PowerPoint Slides for.
Search Engine Optimization (SEO) Presentation By Celina Jonesi Small Business Seo – KG Tech.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.
Search Engine Optimization
Combatting Web Spam Dealing with Non-Main-Memory Web Graphs SimRank
Web Page Elements Writing For the Web
DATA MINING Introductory and Advanced Topics Part III – Web Mining
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
A Comparative Study of Link Analysis Algorithms
CNIT 131 HTML5 – Anchor/Link.
Information Retrieval
Web Spam
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Information Retrieval and Web Design
Presentation transcript:

WEB SPAM

Economic considerations Search has become the default gateway to the web Very high premium to appear on the first page of search results e.g., e-commerce sites advertising-driven sites

What is web spam? Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, regardless of the page’s real value Spam = web pages that are the result of spamming Spammer Approximately 10-15% of web pages are spam

Web spam taxonomy Boosting techniques Techniques to achieve high relevance or importance for some pages Hiding techniques Techniques to hide the adopted boosting techniques from human web users and crawlers

Boosting techniques Term spamming Manipulating the text/fields of web pages in order to appear relevant to queries Link spamming Creating link structures that boost page rank

Cont….

Term spamming Target algorithms

Techniques Phrase Stitching Repetition of one or a few specific terms e.g., free, cheap increase relevance for a document with respect to a small number of query terms Dumping of a large number of unrelated terms e.g., copy entire dictionaries Weaving Copy legitimate pages and insert spam terms at random positions Phrase Stitching Glue together sentences and phrases from different sources

Term spam targets Title HTML meta tags Anchor text URL Body of web page Title HTML meta tags Anchor text URL

Link spam There are three kinds of web pages from a spammer’s point of view Inaccessible pages Accessible pages e.g., web log comments pages spammer can post links to his pages Own pages Completely controlled by spammer May span multiple domain names

Target algorithms Hub scores Hypertext Induced Topic Search (HITS) algorithm Hub scores A spammer should add many outgoing links to the target page t to increase its hub score Authority scores Having many incoming links from presumably important hubs

Cont…. PageRank Uses incoming link information to assign numerical weights to all pages on the web Numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E) Spammers manipulate the algorithm using links

Targets and Techniques Outgoing links Directory cloning Incoming links Create a honey pot Infiltrate a web directory Post links to unmoderated message boards or guest books Participate in link exchange Create own spam farm

Link Farms Spammer’s goal Technique A link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link-based ranking algorithm (such as Google’s PageRank) Spammer’s goal Maximize the page rank of target page t Technique Get as many links from accessible pages as possible to target page t Construct “link farm” to get page rank multiplier effect

Link Farms One of the most common and effective organizations for a link farm

Hiding techniques Content hiding Cloaking Redirection Use same color for text and page background Spam terms or links on a page can be made invisible when the browser renders the page Cloaking Return different page to crawlers and browsers Redirection Alternative to cloaking Redirects are followed by browsers but not crawlers

Cont….

Detecting Web Spam Term spamming Link spamming Analyze text using statistical methods Similar to email spam filtering Also useful: detecting approximate duplicate pages Link spamming Open research area One approach: TrustRank

TrustRank idea Basic principle: approximate isolation It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web Have an oracle (human) identify the good pages and the spam pages in the seed set Expensive task, so must make seed set as small as possible

Trust Propagation Call the subset of seed pages that are identified as “good” the “trusted pages” Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam

Picking the seed set Human has to inspect each seed page, so seed set must be as small as possible Must ensure every “good page” gets adequate trust rank, so need make all good pages reachable from seed set by short paths

Approaches to picking seed set Suppose we want to pick a seed set of k pages PageRank Pick the top k pages by page rank Assume high page rank pages are close to other highly ranked pages We care more about high page rank “good” pages

Rules for trust propagation Trust attenuation The degree of trust conferred by a trusted page decreases with distance Trust splitting The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink Trust is “split” across outlinks

Fighting against web spam Identify instances of spam Prevent spamming Counterbalance

THANK YOU