Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.

Slides:



Advertisements
Similar presentations
1 SEARCH ENGINE OPTIMIZATION AT Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine's.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
1 Advanced Searching Use Query Languages. Use more than one search engine. –Or metasearches like at Start with simple searches. Add.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Chapter 5: Information Retrieval and Web Search
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
Search Engine Optimization
For REAL MEN REAL STYLE.  Search Engine Optimization  SEO is strategies, techniques and tactics to improve or promote a website in order to get a.
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Search Engine Optimization (SEO) Week 07 Dynamic Web TCNJ Jean Chu.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Chapter 5 Searching for Truth: Locating Information on the WWW.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
OFF Page SEO Tips & Tricks Step By Step By IT Team of SlideLearn.com.
© 2006 Stephan M Spencer Netconcepts Search Engine Marketing by Stephan Spencer President, Netconcepts.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
SEO ENRICH YOUR MARKET BY SMART SEARCH SOLUTIONS1.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
SEO Who knew 3 letters could mean so much?. What is SEO? Search Engine Optimization (SEO) is the practice of improving and promoting a web site in order.
Search Engines By: Faruq Hasan.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
SEO - TECHNIQUES Types of SEO SEO techniques can be classified into two broad categories : 1.White Hat SEO 2.Black Hat SEO
Why You Should Optimize Your Website Content. Optimizing a website's content, in order to obtain a high search engine ranking is what Search Engine Optimization.
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
SEO Company or SEO Agency
SEO Company in Miami
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Search Engine Optimization
WEB SPAM.
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 04 – Search Engine Optimisation Mr C Johnston.
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Presentation transcript:

Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference

link stuffing: for link-based ranking, black hat SEO techniques include the creation of extraneous pages which link to a target page keyword-stuffing:The content of other pages may be “engineered” so as to appear relevant to popular searches

Figure 1: An example spam page; although it contains popular keywords, the overall content is useless to a human user

Web spam The practices of crafting web pages for the sole purpose of increasing the ranking of these or some affiliated pages, without improving the utility to the viewer, are called “web spam”.

왜 web spamming 을 하는가 ? 첫째, Search engine 이 스팸사이트를 상위에 rank 하게 하여 웹검색자들을 스팸사이트로 끌여들여 경제적 이득을 취함 둘째로 search engine 이 스팸사이트를 노출시켜 사용자가 search engine 의 성능을 믿지 못하도록 함, 즉 search engine 에 대한 공격 마지막으로 a search engine 이 spam pages 들로 인하여 필요 없는 공간과 시간, 혹은 네트워크 resource 를 을 낭비하게 함. –1/7 of English-language pages

Importance of detecting web spam Creating an effective spam detection method is a challenging problem. –Given the size of the web, such a method has to be automated. –However, while detecting spam, we have to ensure that we identify spam pages alone, and that we do not mistakenly consider legitimate pages to be spam. –At the same time, it is most useful if we can detect that a page is spam as early as possible, and certainly prior to query processing. In this way, we can allocate our crawling, processing, and indexing efforts to non- spam pages, thus making more efficient use of our resources.

Web spamming techniques

Web Spam Taxonomy By Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Workshop on Adversarial Information Retrieval on the Web, May 2005Web Spam Taxonomy

Term Spamming p: page, q: query words TF(t)= 문서에 출현하는 term t 의 수 IDF(t)=term t 를 포함하는 문서의 수 Term spamming 은 TFIDF score 에 기반한 랭킹알고 리즘을 채택하고 있는 search engine 을 대상으로 공격

Term Spamming Body/title/meta tag/Anchor text <meta name=\keywords" content=\buy, cheap, cameras, lens, accessories, nikon, canon"> free, great deals, cheap, in- expensive, cheap, free URL spam buy-canon-rebel-20d-lens-case.camerasx.com, buy-nikon-d100-d70-lens-case.camerasx.com,

How to Term Spamming Repetition of one or a few specific terms Dumping of a large number of unrelated terms Weaving of spam terms into copied contents Phrase stitching is also used by spammers to create content quickly

Link Spamming PageRank 알고리즘의 특징을 파악하여 Outgoing links, Incoming links 를 조작하는 수법

Outgoing links A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score. At the same time, the most wide-spread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com)

Incoming links Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s). Post links on blogs, unmoderated message boards, guest books, or wikis. spammers may include URLs to their spam pages as part of the seemingly innocent comments/messages they post.

Hiding Techniques-Content Hiding

Hiding Techniques-Cloaking If spammers can clearly identify web crawler clients, they can adopt the following strategy, called cloak- ing: given a URL, spam web servers return one specic HTML document to a regular web browser, while they return a dierent document to a web crawler. This way, spammers can present the ultimately intended content to the web users (without traces of spam on the page), and, at the same time, send a spammed document to the search engine for indexing.

Hiding Techniques-Redirection

Spam occurrence per top-level domain 105, 484, 446 web pages, collected by the MSN Search crawler during August 2004.

Spam occurrence per language in our data set.

Prevalence of spam - number of words on page

Prevalence of spam - number of words in title

Prevalence of spam - average word-length of page

Prevalence of spam - visible content on page

Prevalence of spam - compressibility of page

Classification model to detect spam

given the training set DS we generate N training sets by sampling n random items with replacement For each of the N training sets, we now create a classifier, thus obtaining N classifiers. In order to classify a page, we have each of the N classifiers provide a class prediction, which is considered as a vote for that particular class. The eventual class of the page is the class with the majority of the votes

Bagging & Boosting spamNon- spam SpamAB Non- spam CD 예측 실제

Challenges in Web Information Retrieval Mehran Sahami Vibhu Mittal Shumeet Baluja Henry Rowley Google Inc.

Information Retrieval on the Web Goal: identify which pages are of high quality and relevance to a user’s query. –PageRank, HITS Two Challenges –Adversarial classification: detecting Web spamming –Evaluating Search results

PageRank Assume four web pages: A, B,C and D. The initial values of PageRank –PR(A)= PR(B)= PR(C)= PR(D)= PageRank for any page u Bu ={v| v links to page u } Nv = the number of links from page v.

PR(A) = PR(C)/1 PR(B) = PR(A)/2 PR(C) = PR(A)/2 + PR(B)/1+PR(D)/1 PR(D) = 0

Determining the relatedness of fragments of text eg: –“Captain Kirk” & “Star Trek” is similar than –“Captain Kirk” & “Fried Chicken”. How to measure the closeness between two phases. K(x,y) =

Retrieval of UseNet Articles at least 800 million documents

Retrieval of Images and Sounds non-textual “documents” –from digital still and video cameras, camera phones, audio recording devices, and mp3 music.