Information Retrieval (4) Prof. Dragomir R. Radev

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information Retrieval in Practice
Search Engines and Information Retrieval
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
TC2-Computer Literacy Mr. Sencer February 4, 2010.
ISP433/633 Week 3 Query Structure and Query Operations.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Using TF-IDF to Determine Word Relevance in Document Queries
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Modern Information Retrieval Chapter 1 Introduction.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date:
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Basic tasks of generic software Chapter 3. Contents This presentation covers the following: – The basic tasks of standard/generic software including:
Search Engines and Information Retrieval Chapter 1.
Applications Software. Applications software is designed to perform specific tasks. There are three main types of application software: Applications packages.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Information Retrieval
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Presented by: AKHIL GADA CSCI 572 University of Southern California Full Text Indexing Based On Lexical Relations An Application :Software Library by YS.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
Invitation to Computer Science 6 th Edition Chapter 10 The Tower of Babel.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
Definition, purposes/functions, elements of IR systems Lesson 1.
 GEETHA P.  Originally coined by Tim O’Reilly Publishing Media  Second generation of services available on www.  Lets people collaborate and share.
Automated Information Retrieval
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Information Organization: Overview
Information Retrieval on the World Wide Web
Improved Word Alignments Using the Web as a Corpus
Chapter 5: Information Retrieval and Web Search
Retrieval Utilities Relevance feedback Clustering
Information Retrieval and Web Design
Information Organization: Overview
Copyright & Fair Use What You Need to Know!.
Information Retrieval and Web Design
Information Retrieval and Web Design
CSCI 5832 Natural Language Processing
Presentation transcript:

Information Retrieval (4) Prof. Dragomir R. Radev

IR Winter 2010 … 7. Approximate string matching …

Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming

Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)

Example Gusfield 1997 WRITERS V11 I22 N33 T44 N55 E66 R77

Example (cont’d) Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

Tracebacks Gusfield 1997 WRITERS V I N T44444* N55 E66 R77

Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum – heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html

Links Web sites: – – Demo: –/home/cs6998/tools/editDistance/dp/l.pl theater theatre – h.htmlhttp://nayana.ece.ucsb.edu/imsearch/imsearc h.html

Other methods Cosine Generation probabilities (language modeling) (exp)KL-divergence

IR Winter 2010 … 8. Query expansion Relevance feedback …

Query expansion

Corpus-based: mine query logs NLP-based Vector-space relevance feedback

Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space

Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|

Example Q = “safety minivans” D 1 = “car safety minivans tests injury statistics” - relevant D 2 = “liability tests safety” - relevant D 3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?

Pseudo relevance feedback Automatic query expansion –Thesaurus-based expansion (e.g., using latent semantic indexing – later…) –Distributional similarity –Query log mining

Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn

[Otterbacher et al. HLT EMNLP 2005]

Final projects Two formats: –A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. –A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR- related conferences. Deliverables: –System (code + documentation + examples) or Paper (+ code, data) –Poster (to be presented in class) –Web page that describes the project.

Readings 4: MRS15, MRS16 5: MRS17 6: MRS18, MRS19