Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
INFO 624 Week 3 Retrieval System Evaluation
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Information Retrieval
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Data Structures & Algorithms and The Internet: A different way of thinking.
A Search Engine Architecture Based on Collection Selection Diego Puppin University of Pisa, Italy Supervisors: D. Laforenza, M. Vanneschi.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 CS 430: Information Discovery Lecture 5 Ranking.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Text Based Information Retrieval
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Information Retrieval and Web Design
Discussion Class 9 Google.
Presentation transcript:

Document Clustering and Collection Selection Diego Puppin Web Mining,

Web Search Engines: Parallel Architecture To improve throughput, latency, res. quality A broker works as a unique interface Composed of several computing servers Queries are routed to a subset The broker collects and merges results

Doc-Partitioned Approach The document base is split among servers Each server indexes and manages queries for its own documents It knows all terms of some documents Better scalability of indexing/search Each server is independent Documents can be easily added/removed

Term-Partitioned Approach The dictionary is split among servers Each server stores the index for some terms It knows documents where its terms occur Potential for load reduction Poor load balancing (some work...)

Some considerations Every time you add/remove a doc You must update MANY servers With queries: only relevant servers are queried but... servers with hot terms are overloaded

How To Load Balance Put together related terms This minimizes the number of hit servers Try to put together group of documents with similar overall frequence Servers shouldn’t be overloaded Query logs could be used to predict

Multiple indexes Term-based index Query-vector or Bag-of-word Hot text index Titles, Anchor, Bold etc Link-based index It can find related pages etc Key-phrase index For some idioms

termslinkshot query terms referrer, related query terms

How To Doc-Partition? 1. Random doc assignment + Query broadcast 2. Collection selection for independent collections (meta-search) 3. Smart doc assignment + Collection selection 4. Random assignment + Random selection

1. Random + Broadcast Used by commercial WSEs No computing effort for doc clustering Very high scalability Low latency on each server Result collection and merging is the heaviest part

IR Core 1 idx IR Core 2 idx IR Core k idx t 1,t 2,…t q r 1,r 2,…r r query results Broker Distributed/Replicated Documents

2. Independent collections The WSE uses data from several sources It routes the query to the most authoritative collection(s) It collects the results according to independent ranking choices (HARD) Example: Biology, News, Law

Coll1Coll2Coll.k t 1,t 2,…t q r 1,r 2,…r r query results Broker

3. Assignment + Selection WSE creates document groups Each server holds one group The broker has a knowledge of group placement The selection strategy routes the query suitably

Doc cluster1 idx Doc cluster2 idx Doc cluster k idx t 1,t 2,…t q r 1,r 2,…r r query results Broker

4. Random + Random If data (pages, resources...) are replicated, interchangeable, hard to index Data are stored in the server that publishes them We query a few servers hoping to get something

Doc 1 idx Doc 2 idx Doc k idx t 1,t 2,…t q r 1,r 2,…r r query results Broker

CORI The Effect of Database Size Distribution on Resource Selection Algorithms, Luo Si and Jamie Callan Extends the concept of TF.IDF to collections

CORI Needs a deep collaboration from the collections: Data about terms, documents, size Unfeasible with independent collections Statistical sampling, Query-based sampling Term-based: no links, no anchors Very large footprint

Querylog-based Collection Selection Using Query Logs to Establish Vocabularies in Distributed Information Retrieval Milad Shokouhi; Justin Zobel; S.M.M. Tahaghoghi; Falk Scholer The collections are sampled using data from a query log Before: Queries over a dictionary (QBS)

Recall Number of found documents (not for web) Precision at X Number of relevant documents out of the first X results (X= 5, 10, 20, 100) Average precision Average of Precision for increasing X MAP (Mean Average Precision) The average over all queries

And now for something completely different!

Important features It can use any underlying WSE Links, snippets, anchors... Good or bad as the WSE it uses! Small footprint It can be added as another index

termslinkshotquery-vector query terms referrer, related query terms query mapping

Developments Query suggestions The system finds related queries Result grouping Documents are already organized into groups Query expansion Can find more complex queries still matching

Still missing Using the QV model as a full IR model: Is it possible to perform queries over this representation? New query terms cannot be found CORI should be used for unseen queries Better testing with topic shift

Possible seminars/projects Advanced collection selection Statistical sampling, Query-based sampling Partitioning a LARGE collection (1 TB) Load balancing for doc- and term-partitioning Topic shift Query log analysis