Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID

Exploring Reduction for Long Web Queries Niranjan Balasubramanian, Giridhar Kuamaran, Vitor R. Carvalho Speaker: Razvan Belet 1.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Active Learning and Collaborative Filtering

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Evaluating Search Engine

Maggie Zhou COMP 790 Data Mining Seminar, Spring 2011

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from

The PageRank Citation Ranking “Bringing Order to the Web”

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

(hyperlink-induced topic search)

Information Retrieval

Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Presented By: - Chandrika B N

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

RANKING SUPPORT FOR KEYWORD SEARCH ON STRUCTURED DATA USING RELEVANCE MODEL Date: 2012/06/04 Source: Veli Bicer(CIKM’11) Speaker: Er-gang Liu Advisor:

Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Retroactive Answering of Search Queries Beverly Yang Glen Jeh.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Hongbo Deng, Michael R. Lyu and Irwin King

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Sampath Jayarathna Cal Poly Pomona

Evaluation of IR Systems

Learning Literature Search Models from Citation Behavior

Junghoo “John” Cho UCLA

Presentation transcript:

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University of Hannover, Germany Edleno Silva de Moura Federal University of Amazonas, Brazil Pável Calado IST/INESC-ID, Portugal Wolfgang Nejdl L3S and University of Hannover, Germany

Outline Introduction Proposed Noise Removal Techniques Experiments Practical Issues Conclusion and future work

Introduction Link analysis algorithms are a popular source of evidence for search engines. These algorithms analyze the Web’s link structure to assess the quality (or popularity) of web pages.

Introduction This strategy relies on considering links as votes for quality. But not every link is a true vote for quality. We call these links “noisy links”

Examples Link Exchanges between friends; Tightly Knit Communities; Navigational links; Links between mirrored sites; Web Rings; SPAM.

Introduction In this work we propose methods to identify noisy links. We also evaluate the impact of the removal of the identified links.

Introduction Most of the previous works are focused on SPAM. We have a broader focus, focusing on all links that can be considered noisy. This broader focus allow our methods to have a greater impact on the database.

Introduction In this work, we propose site level analysis based methods, i.e., methods based on the relationships between sites instead of pages. Site Level Analysis can lead to new sources of evidence, that aren’t present on page level. Previous works are solely based on page level analysis.

Proposed Noise Removal Techniques Uni-Directional Mutual Site Reinforcement (UMSR); Bi-Directional Mutual Site Reinforcement (BMSR); Site Level Abnormal Support (SLAbS); Site Level Link Alliances (SLLA);

Site Level Mutual Reinforcement

Based on how connected is a pair of sites. Assumption: –Sites that have many links between themselves have a suspicious relationship. Ex: Mirror Sites, Colleagues, Sites from the same group.

Uni-Directional and Bi- Directional Uni-Directional –Counts the number of links between the sites. Bi-Directional –Counts the number of link exchanges between pages of the sites.

Site Level Mutual Reinforcement In this example, we have 3 link exchanges, and a total of 9 links within this pair of sites.

Site Level Mutual Reinforcement After counting, We remove all links between pairs that have more links counted than a given threshold. This threshold was set by experiments.

Site Level Abnormal Support

Based on the following assumption: –The total amount of links to a site (i.e., the sum of links to its pages) should not be strongly influenced by the links it receives from some other site. Quality sites should be linked by many different sites.

Site Level Abnormal Support Instead of plain counting, we calculate the percentage of the total incoming links. If this percentage is higher than a threshold, we remove all links between this pair of sites.

Site Level Abnormal Support For example, if a site A has 100 incoming links, where 10 of that links are from B, B is responsible for 10% of the incoming links to site A.

Site Level Abnormal Support Using percentage avoid some problems of the plain counting of Mutual Reinforcement methods. For instance, tightly knit communities with sites having few links between themselves can be detected.

Site Level Link Alliances

Assumption: –A Web Site is as Popular as diverse and independent are the sites that link it. Sites Linked by a tight community aren’t as popular as sites linked by a diverse set of sites.

Site Level Link Alliances The impact of these alliances on PageRank was previously presented on the literature, but they did not present any solution to it.

Site Level Link Alliances We are interested to know, for each page, how connected are the pages that point to it, considering links between pages in different sites. We called this tightness “suscesciptivity”

Site Level Link Alliances The Susceptivity of a page is, given the set of pages that link to it, the percentage of the links from this set that link to others pages on the same set.

Site Level Link Alliances After the calculus of the susceptivity, the incoming links of a page are downgraded with (1- susceptivity). In PageRank, which was the baseline of the evaluation of the methods, this downgrade was integrated in the algorithm.

Site Level Link Alliances At each iteration, the value downgraded from each link is uniformly distributed between all pages, to ensure convergence.

Experiments

Experimental Setup –The performance of the methods was evaluated by the gain obtained in the PageRank algorithm. –We used in the evaluation the database of the TodoBR search engine, a collection of 12,020,513 pages connected by 139,402,345 links.

Experiments Experimental Setup –The queries used in the evaluation were extracted from the TodoBR log, composed of 11,246,351 queries.

Experiments Experimental Setup –We divided the selected queries in two sets: Bookmark Queries, in which a specific Web page is sought. Topic Queries, in which people are looking for information on a given topic, instead of some page.

Experiments Experimental Setup: –Each set was further divided in two subsets: Popular Queries: The top most popular bookmark/topic queries. Randomly Selected Queries. –Each subset of bookmark queries contained 50 queries, and each subset of topic queries contained 30 queries.

Experiments Methodology –For processing the queries, we selected the results where there was a Boolean match of the query, and sorted these results by their PageRank scores. –Combinations with other evidences was tested, and led to similar results, but with the gains smoothed.

Experiments Methodology: –Bookmark queries evaluation was done automatically, while topic queries evaluation was done by 14 people. –These people evaluated each result as relevant and highly relevant. –This lead to two evaluations for each query: considering both relevant and highly relevant and considering only highly relevant.

Experiments Methodology: –Bookmark queries were evaluated using the Mean Reciprocal Rank (MRR). –In bookmark queries we also used the Mean Position of the right answers as a metric.

Experiments Methodology: –For topic queries, we evaluated the Precision at 5 Precision at 10 and MAP (Mean Average Precision)

Experiments Methodology: –We evaluated each method individually, and also evaluated all possible combinations of methods.

Experiments Algorithm specific aspects: –The concept of site adopted in the experiments was the host part of the URL. –We adopted the MRR as a measure to determine which threshold is the best for each algorithm, being the best the following: MethodThreshold UMSR250 BMSR2 SLAbS2%

Experiments - Results For popular bookmark queries: MethodMRRGain%MPOSGain All Links UMSR % % SLLA %527.06% SLLA+BMSR+SLAbS % %

Experiments - Results For random Bookmark queries: MethodMRRGainMPOSGain All Links UMSR % % SLLA % % SLLA+BMSR+SLAbS %719.76%

Experiments - Results For popular topic queries: MethodMAP HighlyMAP All All Links UMSR 0.207*0.333 SLLA SLLA+BMSR+SLAbS

Experiments - Results For random topic queries: MethodMAP HighlyMAP All All Links UMSR SLLA SLLA+BMSR+SLAbS

Experiments - Results Relative gain for bookmark queries:

Experiments - Results Relative gain for topic queries:

Experiments Amount of removed links : MethodLinks Detected% of total Links UMSR % BMSR % SLAbS % UMSR+BMSR % BMSR+SLAbS %

Practical Issues Complexity : –All Proposed methods have computational cost growth proportional to the number of pages in the collection and the mean number of links per page.

Conclusions and Future Work The proposed methods obtained improvements up to 26.98% in MRR and up to 59.16% in MAP. Also, our algorithms identified 16.7% of the links of the database to be noisy.

Conclusions and future work In future work, we’ll investigate: –The use of different weights for the identified links instead of removing them. –The impact on different link analysis algorithms.

Questions ?