CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.

Slides:



Advertisements
Similar presentations
WEB MINING. Why IR ? Research & Fun
Advertisements

Chapter 5: Introduction to Information Retrieval
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval in Practice
Search Engines and Information Retrieval
Architecture of a Search Engine
Information Retrieval Review
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
1 Information Retrieval and Web Search Introduction.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Introduction to Information Retrieval and Web Search.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Personal Information Management Vitor R. Carvalho : Personalized Information Retrieval Carnegie Mellon University February 8 th 2005.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
 Used MapReduce algorithms to process a corpus of web pages and develop required index files  Inverted Index evaluated using TREC measures  Used Hadoop.
Information Retrieval
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
Proposal for Term Project
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval and Web Search
Information Retrieval on the World Wide Web
Information Retrieval and Web Search
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Information Retrieval and Web Search
Presentation transcript:

CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating large collections of human-language data.

Housekeeping Web page: Area: “Applications/Databases” Meeting times : Mondays, 2:00-5:00, MC2036

NLPDB ML IR

Topics 1.Basic techniques 2.Searching, browsing, ranking, retrieval 3.Indexing algorithms and data structures 4.Evaluation 5.Application areas

1. Basic Techniques Text representation & Tokenization Inverted indices Phrase searching example Vector space model Boolean retrieval Simple proximity ranking Test collections & Evaluation

2. Retrieval and Ranking Probabilistic retrieval and Okapi BM25F Language modeling Divergence from randomness Passage retrieval Classification Learning to rank Implicit user feedback

3. Indexing Algorithms and data structures Index creation Dynamic update Index compression Query processing Query optimization

4. Evaluation Statistical foundations of evaluation Measuring Efficiency Measuring Effectiveness –Recall/Precision –NDCG –Other measures Building a test collection

5. Application Areas Parallel retrieval architectures Web search (Link analysis/Pagerank) XML retrieval Filesystem search Spam filtering

Other Topics (student projects) Image/video/speech retrieval Web spam Cross- and multi-lingual IR Clustering Advertising/Recommendation Distributed IR/Meta-search Question answering etc.

Resources Textbook (partial draft on Website): Büttcher, Clarke & Cormack. Information Retrieval: Data Structures, Algorithms and Evaluation. (start reading ch. 1-3) Wumpus:

Grading Short homework exercises from text (10%) A literature review based on a topic area selected by the student with the agreement of the instructor (30%) 30-minute presentation on your selected topic (20%) Class project (40%) – details coming up..

“Documents” Documents are the basic units of retrieval in an IR system. In practice they might be: Web pages, messages, LaTeX files, news articles, phone message, etc. Update: add, delete, append(?), modify(?) Passages and XML elements are other possible units of retrieval.

Probability Ranking Principle If an IR system’s response to a query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its users will be maximized.

Evaluating IR systems Efficiency vs. effectiveness Manual evaluation –Topic creation and judging –TREC (Text REtreival Conference) –Google Has 10,000 Human Evaluators? Evaluation through implicit user feedback Specificity vs. exhaustivity

shark attacks Where do shark attacks occur in the world? Are there beaches or other areas that are particularly prone to shark attacks? Documents comparing areas and providing statistics are relevant. Documents describing shark attacks at a single location are not relevant.

Class Project:Wikipedia Search Can we outperform Google on the Wikipedia? Basic project: Build a search engine for the Wikipedia (using any tools you can find). Ideas: Pagerank, spelling, structure, element retrieval, summarization, external information, user interfaces

Class Project: Evaluation Each student will create and judge n topics. The value of n depends on the number of students. (But workload stays the same.) Quantitative measure of effectiveness. Qualitative assessment of user interfaces. Volunteer needed to operate the judging interface (for credit).

Class Project: Organization You may work in groups (check with me). You may work individually (check with me). You may create and share tools with other students. You get the credit. (e.g. Volunteer needed to set up a class wiki.) Programming can’t be avoided, but can be minimized. ☺ Programming can also be maximized.

Class Project: Grading Topic creation and judging: 10% Other project work: 30% –You are responsible for submitting one experimental run for evaluation. –Other activities are up to you.

One line?

Tokenization For English text: Treat each string of alphanumeric characters as a token. Number sequentially from the start of the text collection. For non-English text: Depends on the language (possible student projects) Other considerations: Stemming, stopwords, etc.

Inverted Indices Basic data structure More next day…

Plan Sept 17: –Inverted indices (from Chapter 3) –Index construction/Wumpus (Stefan) Sept 24: –Vector space model, Boolean retrieval, proximity –Basic evaluation methods October 1: –Probabilistic retrieval, language modeling –Start topic creation for class project October 8: Web search