6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Focused Crawling A New Approach to Topic-Specific Web Resource Discovery Soumen Chakrabarti Martin van Den Berg Byron Dom.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR.
Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Search engines. The number of Internet hosts exceeded in in in in in
Chapter 19: Information Retrieval
Link Structure and Web Mining Shuying Wang
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Internet Research Search Engines & Subject Directories.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Adversarial Information Retrieval The Manipulation of Web Content.
Search Engines and Information Retrieval Chapter 1.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Using Hyperlink structure information for web search.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Ranking Link-based Ranking (2° generation) Reading 21.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Information Organization: Overview
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Federated & Meta Search
Text & Web Mining 9/22/2018.
Search Engines & Subject Directories
A Comparative Study of Link Analysis Algorithms
Information Retrieval
Data Mining Chapter 6 Search Engines
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Search Engines & Subject Directories
Search Engines & Subject Directories
Information Retrieval and Web Design
Information Organization: Overview
Presentation transcript:

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao

6/16/ Introduction Classical IR: Indexing a collection of documents Answering queries by returning a ranked list of relevant document Problems for retrieve online document Ambiguity Context sensitivity Synonymy Polysemy Large amount of relevant Web pages

6/16/ Introduction Directory-based topic browsing: tree-like structure Most Maintained by human expert Advantages: exemplary, influential Disadvantages: slow, subjective and noisy

6/16/ Introduction Standard crawler and search engine 1997: cover 35-40% out of 340 million Web pages 1999: cover 18% out of 800 million Web pages Cannot be used for maintaining generic portals and automatic resource discovery

6/16/ Introduction Focused crawler: Can selectively seek out pages that are relevant to pre-defined set of topics Experts and researchers preferred Two modules: Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog Distiller: identifies the centrality of crawled pages to determine visit priorities

6/16/ Distillation techniques Google: Simulate a random wander on the Web Ranked by pre-computed popularity and visitation rate fast

6/16/ Distillation techniques HITS (Hyperlink Induced Topic Search): Depends on a search engine Combine two scores: Authorities: identify pages with useful information about a topic Hubs: identify pages that contain many links to pages with useful information on the topic Query dependent and slow May lead topic contamination or drift

6/16/ Distillation techniques ARC and CLEVER: ARC (Automatic Resource Complier): part of CLEVER Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document

6/16/ Distillation techniques Outlier filtering: Computes relevance weights for pages using Vector Space Model All pages whose weights are below a threshold are pruned Effectively prune away outlier nodes in the neighborhood, thus avoid contamination

6/16/ Topic distillation vs. Resource discovery Topic distillation: Depend on large, comprehensive Web crawls and indices (Post processing) Can be used to generate a Web taxonomy? Set a keyword query for each node in the taxonomy Run a distillation program Simple but have some problems

6/16/ Query: +"power suppl*" ßwitch* mode" smps - multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSuppliesYahoo! To match the directory based browsing quality of : Yahoo!: 7.03 terms and 4.34 operators Alta Vista: 2.35 terms and 0.41 operators Problems: Construction the query: involves trial, error and complicated thought Query: “North American telecommunication companies” Topic distillation vs. Resource discovery

6/16/ Topic distillation vs. Resource discovery Problems: Contamination stop-sites: not automatic terming weighting edge weighing: no precise algorithm to set the weight Topic distillation by itself is not enough for resource discovery

6/16/ Hypertext classification: learning from example Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result The contents of the given example and its neighbors provide a way to compute the decision boundary of classification NN, Bayesian and support vector classifiers

6/16/ Hypertext classification Link-based features: important Circular topic influence Topic of one page influences its text and its neighbor page’s topic Knowledge of the linked vicinity’s topic provides clues for the test document’s topic Bibliometric, more general than the simple linear endorsement model used in topic distillation

6/16/ Putting it together for resource discovery

6/16/ Conclusion Emphasized the importance of scalable automatic resource discovery Argued that common search engines are not adequate to achieve the resource discovery Introduced the recently invented focused crawling system

6/16/ Future Works How to derive the training examples automatically? How to personalize the outcome of focused crawler for users?