Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Optimizing search engines using clickthrough data
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
A Quality Focused Crawler for Health Information Tim Tang.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali
How Search Engines Work Source:
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
TERM PROJECT The Project usually consists of the following: Title
Company LOGO B2C E-commerce Web Site Quality: an Empirical Examination (Cao, et al) Article overview presented by: Karen Bray Emilie Martin Trung (John)
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Clustering of search engine results by Google CWI, Amsterdam, The Netherlands Vrije Universiteit.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Similar Document Search and Recommendation Vidhya Govindaraju, Krishnan Ramanathan HP Labs, Bangalore, India JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Amy Dai Machine learning techniques for detecting topics in research papers.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Post-Ranking query suggestion by diversifying search Chao Wang.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Spamdexing
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.
Evaluation Anisio Lacerda.
WEB SPAM.
Source: Procedia Computer Science(2015)70:
A Comparative Study of Link Analysis Algorithms
Movie Recommendation System
Anatomy of a Search Search The Index:
Retrieval Performance Evaluation - Measures
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project

Overview  Introduction  Previous Work  Methodology o Word-similarity o Hidden Content o Phrase-similarity  Initial Results  Conclusion

Introduction:  Information is available on the Web. However, 14 % of the Web consists of Spam Web pages.  Spam Web pages: o Web Pages that receive an unjustifiably favorable relevance or high ranking, regardless of their true value. o Attempt to deceive a search engine’s relevancy ranking algorithm.  Serious retrieval problem: o Quality of Web search is affected. o Search engines’ reputation is damaged. o User’s trust in the retrieval process is weakened.

Previous Work  Content Analysis: o [Ntoulas et al ] Introduce and combine several heuristics based on the content of a Web page (number of words in a page, average length of words, fraction of visible content).  Link Analysis: o [Becchetti et al ] and [Benczur et al ] consider links to and from a given Web page in order to determine if it is spam.

Methodology  Focus on the title and the body of a Web page in order to determine whether they are spam: o In legitimate Web pages the title and the body are closely related. o In spam Web pages, the title and the body are usually not related.

Methodology  Computing the title-body similarity: o Word-correlation factors, computed using Wikipedia documents: o Degree of resemblance between t (a word in a title) and B (the body of a Web page): o Degree of similarity between the words in the title and the words in the body of a Web page:  Status of a Web page:

Methodology  Fraction of Hidden Content: o Proportion of markup content of a given Web page (spam Web pages tend to content less markup than legitimate Web pages): o Threshold value to determine the status of a Web page:

Methodology  Phrase similarity value o Use the Odds measure to determine the phrase-correlation factor (based on the word-correlation factor): o Phrase similarity threshold value

Overall Spam Detection Approach

Experimental Results  WEBSPAM-UK2006: 77.9 millions of classified (spam, non- spam, borderline) Web pages.  Accuracy – Error Rate, using phrase similarity:

Experimental Results  Enhancement of the phrase similarity approach: o Method A: only phrase similarity. o Method B: phrase similarity as well as hidden content.

Experimental Results  Our performance (in terms of F-Measure) with respect to other known spam-detection approaches.

Conclusion  By using the phrase (words) in the title and body of a Web page as well as the fraction of hidden content we achieve 92% accuracy.  Computational inexpensive: can be incorporated into existing search engines to enhance Web searches

Questions