Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Slides:



Advertisements
Similar presentations
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Chapter 5: Introduction to Information Retrieval
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
A Quality Focused Crawler for Health Information Tim Tang.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Scalable Text Mining with Sparse Generative Models
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Master Thesis Defense Jan Fiedler 04/17/98
Using Hyperlink structure information for web search.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Recommending Twitter Users to Follow Using Content and Collaborative Filtering Approaches John HannonJohn Hannon, Mike Bennett, Barry SmythBarry Smyth.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval on the World Wide Web
A Comparative Study of Link Analysis Algorithms
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
IR Theory: Evaluation Methods
Panagiotis G. Ipeirotis Luis Gravano
17th APAN Meetings & Joint Techs Workshop
Information Organization: Overview
Introduction to Search Engines
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei

1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method 2. G.Pant et al. Panorama: Extending Digital Libraries with Topical Crawlers  (JCDL 2004)

Outline Problem Description Research Background Their Approaches  Designing Classifier  Enhancing Meta-Search  Identifying Communities in Collection Experiments Discussion

Problem Description Problem: Collect domain specific documents from the Web and manage the literature collection. Topical Crawlers (Focused Crawlers) are designed to collect domain specific docs How to bridge the gab of Web communities.

Digital Library vs. Search Engine Digital Library  Domain specific, serving for literature study  High Quality  Topical Crawler + Collection management  Knowledge discovery Search Engine  General, serving for web search  High Quantity  General Crawler + Online retrieval  Indexing, retrieving performance, etc.

Research Background Domain Definition VSM, Naïve Bayesian, SVM, etc. Expand training set, Get starting url BFS, Best first search, Tree pruning, Multiple starting urls, Tunneling TF-IDF, K-Mean, etc Page Rank, HITS, etc.

Why General Crawler with Breadth first search doesn’t work?

Web Communities

Design Classifier (Pant et al.) Motivation: define the domain and distinguish relevant & non-relevant documents Approach:  Query Google with title & reference to construct positive/negative example set (training set)  Use Vector Space Model to represent documents, use TF-IDF as term weights  Use Naïve Bayesian Classifier to estimate Pr(c + |q), which is used for ranking

Design Classifier (cont.) TF-IDF weighting:

Enhancing Meta-Search (Qin et al.) Motivation: Solve the limitation of Local Search algorithm in Crawling, bridge distributed web communities Approach:  Manually provide domain specific queries  Query Meta-search Engine to get multiple starting urls.

Identifying Communities in Collection (Pant et al.) Motivation: analyze the latent structures in collection, summarize and represent potential communities Approach:  Use k-mean for content clustering  Use HITS for structural clustering  Label clusters by TF-IDF filtering

Experiments (Qin et al.) Experiments Design  Compare with Google and a Domain Specific SE pages, 1/3 from meta-search method.  Compare meta-search enhanced crawling with traditional one, by means of precision pages from baseline method.  Experts define queries and judge results.

Experiments (Qin et al.) cont. Experiment Result 1:  Their approach:  General SE:  Domain Specific:  Meta-search enhanced method better than general search engine and traditional domain specific search engine. Experiment Result 2  Expert ranking results in range 1-4  Meta-search Enhanced: 2.77  Baseline: 2.51  In Top 100 results from Meta-search Enhanced collection: from meta-search: 3.22, rest: 2.61

Experiments (Pant et al.) Experiments Design:  Test Bed: from CiteSeer. 94 papers as initial documents. Use one (expanded by querying Google) for building positive example set, 93 for building negative example set. Compare with a BFS crawler. Harvest rate:

Experiments (Pant et al.) cont. Experiment results: InterWeave: A middleware system for distributed shared states

Conclusions System overview for building literature collection by topical web crawlers Classifier enhanced Best first search performs better than Breadth first search. Meta-search enhanced topical crawler performs better than topical crawlers without meta-search. A clustering based method to represent latent community structures in collection

Discussion Contribution of these two papers  Qin et’ al: enhance meta-search to get multiple starting urls  Pant et’ al: clarify and implement a sound system structure. Post a way to discover latent communities in collection Constraints  no significant theoretical contribution  experiments not convincing

Thanks!