Download presentation
Presentation is loading. Please wait.
Published byGwendolyn Doyle Modified over 9 years ago
1
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University
2
Contents Model of the Web Definition of Web Spam History of Web Spam Types of Web Spam Counter measures Conclusion
3
The World Wide Web Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi- structured data. Content includes truth, lies, obsolete information, contradictions, …
4
Search Engine As Gateways Search has become the default gateway to the web Very high premium to appear on the first page of search results –e.g., e-commerce sites –advertising-driven sites This has important economic considerations
5
Definition Of Web Spam Web Spam can be defined as any intentional activity by a human to generate an unreasonably favorable result or importance for a web page that naturally should not have the weight or significance associated to it.[1] This is also called spamming or spamdexing.
6
History Of Web Spam It was introduced by the 1 st Generation Search Engine Companies in the 1990’s - The technique came to be known as ‘Glittering Generalities’ 2 nd Generation Search Engine Companies - Neutralized Glittering Generalities - Ranked pages according to their popularity - Popularity determined by Links pointing to the Web page - Spammers made Link farms to circumvent it 3 rd Generation Search Engine Companies - use page rank, HITS algorithm to rank pages - Spammers have found new ways as well!
7
Boosting Techniques These are the spamming techniques by which the ranking algorithm of the search engine is influenced. Can be classified into two main categories - Term Spamming : Manipulating the text of web pages in order to appear relevant to queries - Link Spamming : Creating link structures that boost page rank or hubs and authorities scores
8
Taxonomy For Boosting Techniques
9
Types Of Term Spamming Body Spam: The spam terms are present in the body of the page. This is the simplest and most common technique in term spamming. Title Spam: The spam terms are present in the title tag of the web page. Meta Tag Spam: The spam terms appear in the Meta tags of the web page. e.g. <meta name=\Flowers " content=\buy, cheap, roses, lilly, daffodils, flower vase, pink rose">
10
Types Of Term Spamming Anchor Text Spam: The spam term appears in the anchor texts found on the web pages. - The terms in anchor text given more importance - The words are indexed both for target as well as source page e.g. Flowers, cheap deals, rose, daffodils, flower vase
11
Types Of Term Spamming URL spam: Spam terms appear in the URL of web pages - Search engines sometimes parse the URL and use the terms in the URL to find whether the page is relevant or not. DNS spam: Spammers set up a dns server, which resolves any hostname to one domain only. Repetition: The term is repeated n number of times in the field of the web page to make it suitable for a specific query.
12
Types Of Term Spamming Dumping: A large number of unrelated terms are put together in the fields of the web page. - Helps in answering a wide variety of queries Weaving: Duplication of content found on the web page by insertion of spam terms in between the content. Phrase Stitching: Different sentences from different source are concatenated to put in the fields of the web page. e.g. His article is about forests as communities of trees. Naco is the world leader in Rain forest protection
13
Types Of Link Spamming Outdegree: Spammers create web pages which have a high number of links pointing to well known pages. - Can be done easily by directory cloning Indegree: Spammers create pages which has useful content but hidden links to spam pages. - These pages are called honey pots - Can be achieved by adding links in directory structures - Link farms
14
Hiding Technqiues Hiding Technqiues: Techniques to hide spam content on a web page. - Content Hiding - Cloaking - Redirection
15
Types Of Hiding Technqiues Content Hiding: Spam content on the page is hidden by using - Color Schemes - Images in place of anchor text e.g Using color for content hiding spam text Using images in anchor text
16
Types Of Hiding Technqiues Cloaking: Send different content to the crawlers and different content to the users. - Pages check the ip address of crawlers - check the agent field in the HTTP request
17
Cloaking Example HTTP Request to the page GET / HTTP/1.1[CRLF] Host: yahoo.com[CRLF] Connection: close[CRLF] User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: 1.8.0.9) Gecko/20061206 Firefox/1.5.0.9 Web- Sniffer/1.0.24[CRLF] Referer: http://web-sniffer.net/[CRLF]
18
Cloaking Example
19
Types Of Hiding Technqiues Redirection: Web pages redirect to spam pages on opening. - Search Engines index the normal page - User gets redirected to spam page on opening it
20
Trust Rank Basic principle: approximate isolation –It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam
21
Anti-Trust Approach Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti- Trust Rank value more than a chosen threshold value.
22
Conclusion Web Spam is a by-product of the search engine era Identifying the structure of web spam is the first step to fighting it. Due to the inherent characterstic of the Web it is difficult to eliminate web spam all together. Combination of different web spam techniques can be combined together to detect spam in a better way
23
Thank you References [1] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. http://citeseer.ist.psu.edu/article/gyongyi05web.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.