(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003.

Slides:



Advertisements
Similar presentations
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Advertisements

Information Retrieval in Practice
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Information Retrieval
Overview of Web Data Mining and Applications Part I
(C) 2000, The University of Michigan 1 Database Application Design Handout #11 March 24, 2000.
Overview of Search Engines
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Emerging Topic Detection on Twitter (Cataldi et al., MDMKDD 2010) Padmini Srinivasan Computer Science Department Department of Management Sciences
(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.
12 July 2002Colloquim on Applications of Natural Langauge Corpora, Saarland University Domain-specific Web Corpora and their Applications Gregor Erbach.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Computing & Information Sciences Kansas State University Monday, 04 Dec 2006CIS 560: Database System Concepts Lecture 41 of 42 Monday, 04 December 2006.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
(C) 2003, The University of Michigan1 Information Retrieval Handout #9 March 31, 2003.
Chapter 6: Information Retrieval and Web Search
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Sogang University A. I. Lab. Effective site finding using link anchor information Effective site finding using link anchor information Sung Hae, Jun Artificial.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Search Tools and Search Engines Searching for Information and common found internet file types.
Evgeniy Gabrilovich and Shaul Markovitch
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Search Engine-Crawler Symbiosis: Adapting to Community Interests
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
(C) 2000, The University of Michigan 1 Database Application Design Handout #2 January 14, 2000.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
A Probabilistic Model for Fine-Grained Expert Search Shenghua Bao, Huizhong Duan, Qi Zhou, Miao Xiong, Yunbo Cao, Yong Yu June , 2008, Columbus Ohio.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Information Retrieval (9) Prof. Dragomir R. Radev
(C) 2003, The University of Michigan1 Information Retrieval Handout #10 April 7, 2003.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
Information Retrieval CIS-462 Dr. Samir Tartir 2013/2014 First Semester.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Information Retrieval in Practice
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Query Type Classification for Web Document Retrieval
Information Retrieval CIS-462
Presentation transcript:

(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M&F Course page: Class meets on Mondays, 1-4 PM in 409 West Hall

(C) 2003, The University of Michigan3 Schedule Readings for 03/31: –Chakrabarti, van den Berg, and Dom “Focused Crawling” WWW 1999 –Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 –Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002

(C) 2003, The University of Michigan4 Schedule March 24 –The link-content hypothesis –XML retrieval March 31 –Information extraction –Language reuse April 7 –Language modeling for IR –The Lemur system

(C) 2003, The University of Michigan5 Schedule HW3 assigned 03/24 HW3 due 04/07 Final projects due 04/11 Final project presentations 04/14 Final exam 04/ essay questions, 2-3 problems

(C) 2003, The University of Michigan6 The link-content hypothesis

(C) 2003, The University of Michigan7 Kleinberg and Lawrence, The structure of the Web - Science Web structure

(C) 2003, The University of Michigan8 Web structure links on average The fraction of pages with n in-links is approximately n -  for  ~ 2.1 Kleinberg/Lawrence: 100,000 coherent communities (e.g., people concerned with oil spills off the coast of Japan)

(C) 2003, The University of Michigan9 Topical locality [Davison 00] Most web pages are linked to others with related content - this helps users navigate the Web. Presence of topical locality - important for building focused crawlers. Traditionally search engines only indexed titles and/or the first few lines of each document. Now, they index all links. “More evil than Satan himself”

(C) 2003, The University of Michigan10 Experimental design Local crawl of 100,000 pages Starts from HotBot and AltaVista Biased towards English-language pages From each page, retrieve one outgoing link per page.

(C) 2003, The University of Michigan11 TFIDF cosine similarity

(C) 2003, The University of Michigan12 Other metrics Query-document overlap Query term probability

(C) 2003, The University of Michigan13 Experimental results 100,000 URLs but only 89,891 retrievable An additional 111,107 URLs: two children per initial page (561), etc. 18% top-level pages 50%.com, 27%.edu

(C) 2003, The University of Michigan14 Textual similarity TFIDF similarity –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random

(C) 2003, The University of Michigan15 Structure and content [Menczer 01] Cluster hypothesis (van Rijsbergen 79) Link-cluster conjecture (Menczer) - preservation of semantics across link

(C) 2003, The University of Michigan16 Experimental design Open directory project (dmoz.org) 896,233 URLs from 97,614 topics 150,000 URLs from 47,174 topics 10,000 from each of the 15 top-level branches

(C) 2003, The University of Michigan17 Measures of similarity Cosine Link similarity Semantic similarity lca c2c2 c1c1

(C) 2003, The University of Michigan18 Correlations between similarities Over 3.84x10 9 pairs Highest for News and Home (  > 0.2) Lowest for Arts and Games (  < 0.05)

(C) 2003, The University of Michigan19 Fit  1 =1.8,  2 =0.6,

(C) 2003, The University of Michigan20 Document closures for Q&A capital P LP Madrid spain capital

(C) 2003, The University of Michigan21 Document closures for IR Physics P LP Physics Department University of Michigan

(C) 2003, The University of Michigan22 The perltree experiments 23.6% of the Excite log (2.5 M queries) –60% have both words in WordNet –27% have one word in WordNet –13% have no words in WordNet 200 queries from the log 200 random queries

(C) 2003, The University of Michigan23 Two-word queries jimi SAT seats david caesar poker cruise yellow science Tishara trim yankee witnesses naked swaybar cheats rides Precious drugs university Clock engines metal choreography anthony swinging psychoanalysis webdesign pic lens toys online speech therapy Malcolm McDowell cellular accessories migrant farmworkers witch tv davis instruments Adult Games chichen itza freighter Cruises used motorcycles feng shui revolucion mexicana zeebrugee belgium electronic greetings

(C) 2003, The University of Michigan24 Query analysis Words: –Familiarity –Ambiguity –IDF Queries; –GoogleSize –SemDist –DistribSim

(C) 2003, The University of Michigan25 Query analysis Fam1Fam2Amb1Amb2IDF1IDF2GsizeSemDDistS Excite (E) , Random (R) ,

(C) 2003, The University of Michigan26 Link-based language models Wt2g corpus 247,491 pages 3,118,248 links 948,036 unique words

(C) 2003, The University of Michigan27

(C) 2003, The University of Michigan28 Procedure Given a query q 1 q 2 –Get top 50 hits from Altavista (A) –Extract links that contain q 1 or q 2 –Get pages that are linked (B) –Extract links from A U B that point to A U B –Index A U B using glimpse –Compute link fertility

(C) 2003, The University of Michigan29 Results New links pointing to pages that were not in the AltaVista top 50 –E = +11.7%, R = +8.9% Improvements higher for –rarer words –lower distributional similarity –lower semantic distance

(C) 2003, The University of Michigan30 Topic distillation [Chakrabarti et al. 01] Topic drift Returning snippets rather than full documents Clique attacks (