WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

TrustRank Algorithm Srđan Luković 2010/3482
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Cloak and Dagger. In a nutshell… Cloaking Cloaking in search engines Search engines’ response to cloaking Lifetime of cloaked search results Cloaked pages.
Search Engines and Information Retrieval
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Web Searching. Web Search Engine A web search engine is designed to search for information on the World Wide Web and FTP servers The search results are.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Search Engine Optimization
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Adversarial Information Retrieval The Manipulation of Web Content.
Lecturer: Ghadah Aldehim
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Spamdexing
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
ACSIUS Technologies Pvt. Ltd. Tomorrow’s Success Starts Today!
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Search Engine Optimization
WEB SPAM.
Methods and Apparatus for Ranking Web Page Search Results
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Chapter 27 WWW and HTTP.
CNIT 131 HTML5 – Anchor/Link.
Information Retrieval
Data Mining Chapter 6 Search Engines
Web Search Engines.
Presentation transcript:

WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University

Contents Model of the Web Definition of Web Spam History of Web Spam Types of Web Spam Counter measures Conclusion

The World Wide Web Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi- structured data. Content includes truth, lies, obsolete information, contradictions, …

Search Engine As Gateways Search has become the default gateway to the web Very high premium to appear on the first page of search results –e.g., e-commerce sites –advertising-driven sites This has important economic considerations

Definition Of Web Spam Web Spam can be defined as any intentional activity by a human to generate an unreasonably favorable result or importance for a web page that naturally should not have the weight or significance associated to it.[1] This is also called spamming or spamdexing.

History Of Web Spam It was introduced by the 1 st Generation Search Engine Companies in the 1990’s - The technique came to be known as ‘Glittering Generalities’ 2 nd Generation Search Engine Companies - Neutralized Glittering Generalities - Ranked pages according to their popularity - Popularity determined by Links pointing to the Web page - Spammers made Link farms to circumvent it 3 rd Generation Search Engine Companies - use page rank, HITS algorithm to rank pages - Spammers have found new ways as well!

Boosting Techniques These are the spamming techniques by which the ranking algorithm of the search engine is influenced. Can be classified into two main categories - Term Spamming : Manipulating the text of web pages in order to appear relevant to queries - Link Spamming : Creating link structures that boost page rank or hubs and authorities scores

Taxonomy For Boosting Techniques

Types Of Term Spamming Body Spam: The spam terms are present in the body of the page. This is the simplest and most common technique in term spamming. Title Spam: The spam terms are present in the title tag of the web page. Meta Tag Spam: The spam terms appear in the Meta tags of the web page. e.g. <meta name=\Flowers " content=\buy, cheap, roses, lilly, daffodils, flower vase, pink rose">

Types Of Term Spamming Anchor Text Spam: The spam term appears in the anchor texts found on the web pages. - The terms in anchor text given more importance - The words are indexed both for target as well as source page e.g. Flowers, cheap deals, rose, daffodils, flower vase

Types Of Term Spamming URL spam: Spam terms appear in the URL of web pages - Search engines sometimes parse the URL and use the terms in the URL to find whether the page is relevant or not. DNS spam: Spammers set up a dns server, which resolves any hostname to one domain only. Repetition: The term is repeated n number of times in the field of the web page to make it suitable for a specific query.

Types Of Term Spamming Dumping: A large number of unrelated terms are put together in the fields of the web page. - Helps in answering a wide variety of queries Weaving: Duplication of content found on the web page by insertion of spam terms in between the content. Phrase Stitching: Different sentences from different source are concatenated to put in the fields of the web page. e.g. His article is about forests as communities of trees. Naco is the world leader in Rain forest protection

Types Of Link Spamming Outdegree: Spammers create web pages which have a high number of links pointing to well known pages. - Can be done easily by directory cloning Indegree: Spammers create pages which has useful content but hidden links to spam pages. - These pages are called honey pots - Can be achieved by adding links in directory structures - Link farms

Hiding Technqiues Hiding Technqiues: Techniques to hide spam content on a web page. - Content Hiding - Cloaking - Redirection

Types Of Hiding Technqiues Content Hiding: Spam content on the page is hidden by using - Color Schemes - Images in place of anchor text e.g Using color for content hiding spam text Using images in anchor text

Types Of Hiding Technqiues Cloaking: Send different content to the crawlers and different content to the users. - Pages check the ip address of crawlers - check the agent field in the HTTP request

Cloaking Example HTTP Request to the page GET / HTTP/1.1[CRLF] Host: yahoo.com[CRLF] Connection: close[CRLF] User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: ) Gecko/ Firefox/ Web- Sniffer/1.0.24[CRLF] Referer:

Cloaking Example

Types Of Hiding Technqiues Redirection: Web pages redirect to spam pages on opening. - Search Engines index the normal page - User gets redirected to spam page on opening it

Trust Rank Basic principle: approximate isolation –It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam

Anti-Trust Approach Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti- Trust Rank value more than a chosen threshold value.

Conclusion Web Spam is a by-product of the search engine era Identifying the structure of web spam is the first step to fighting it. Due to the inherent characterstic of the Web it is difficult to eliminate web spam all together. Combination of different web spam techniques can be combined together to detect spam in a better way

Thank you References [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb),