Introduction to Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern Information Retrieval Chapter 1: Introduction
Web Search and Mining Course Overview 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 0: Course Overview.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Introduction to Text Mining
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Information Retrieval in Practice
Information Retrieval - Organization of the course Jian-Yun Nie 聂建云.
1 Information Retrieval and Web Search Introduction.
Information Retrieval
Overview of Search Engines
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Overview of IR Research ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction to Information Retrieval Hongning Wang
Search Engines and Information Retrieval Chapter 1.
1 Information Retrieval and Advanced Internet Services 290N Class Introduction Tao Yang, 2015
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
CS598CXZ (CS510) Advanced Topics in Information Retrieval (Fall 2014) Instructor: ChengXiang (“Cheng”) Zhai 1 Teaching Assistants: Xueqing Liu, Yinan Zhang.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Course Overview for Web Computing J. H. Wang Sep. 19, 2011.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Search Engine Architecture
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Course Overview: An Introduction to Information Retrieval and Applications J. H. Wang Feb. 22, 2012.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
IR. SI 650/EECS 549 Information Retrieval People search the Web daily Search engines –Google –Bing –Baidu –Yandex Information Retrieval is about search.
Introduction to Text Mining Hongning Wang
Relevance Feedback Hongning Wang
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Searching the Web for academic information Ruth Stubbings.
Information Retrieval in Practice
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
CS6501 Advanced Topics in Information Retrieval Course Policy
Search Engine Architecture
Information Retrieval (in Practice)
Introduction to Information Retrieval
Information Retrieval and Web Search
Search Engine Architecture
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Implementation Issues & IR Systems
Information Retrieval and Web Search
Information Retrieval and Web Search
Overview of IR Research
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Data Mining Chapter 6 Search Engines
Information Retrieval Systems
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
CS4501: Information Retrieval Course Policy
Search Engine Architecture
Information Retrieval and Web Search
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Introduction to Information Retrieval Hongning Wang CS@UVa

What is information retrieval? CS@UVa CS4501: Information Retrieval

Why information retrieval Information overload “It refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information.” - wiki CS@UVa CS4501: Information Retrieval

Why information retrieval Information overload Figure 1: Growth of Internet Figure 2: Growth of WWW CS@UVa CS4501: Information Retrieval

Why information retrieval Handling unstructured data Structured data: database system is a good choice Unstructured data is more dominant Text in Web documents or emails, image, audio, video… “85 percent of all business information exists as unstructured data” - Merrill Lynch Unknown semantic meaning Total Enterprise Data Growth 2005-2015, IDC 2012 Table 1: People in CS Department ID Name Job 1 Jack Professor 3 David Stuff 5 Tony IT support CS@UVa CS4501: Information Retrieval

Why information retrieval An essential tool to deal with information overload You are here! CS@UVa CS4501: Information Retrieval

History of information retrieval Idea popularized in the pioneer article “As We May Think” by Vannevar Bush, 1945 “Wholly new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.” “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.” -> WWW -> Search engine CS@UVa CS4501: Information Retrieval

Major research milestones Early days (late 1950s to 1960s): foundation of the field Luhn’s work on automatic indexing Cleverdon’s Cranfield evaluation methodology and index experiments Salton’s early work on SMART system and experiments 1970s-1980s: a large number of retrieval models Vector space model Probabilistic models 1990s: further development of retrieval models and new tasks Language models TREC evaluation Web search 2000s-present: more applications, especially Web search and interactions with other fields Learning to rank Scalability (e.g., MapReduce) Real-time search CS@UVa CS4501: Information Retrieval

History of information retrieval Catalyst Academia: Text Retrieval Conference (TREC) in 1992 “Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies.” “… about one-third of the improvement in web search engines from 1999 to 2009 is attributable to TREC. Those enhancements likely saved up to 3 billion hours of time using web search engines.” Till today, it is still a major test-bed for academic research in IR CS@UVa CS4501: Information Retrieval

History of information retrieval Catalyst Industry: web search engines WWW unleashed explosion of published information and drove the innovation of IR techniques First web search engine: “Oscar Nierstrasz at the University of Geneva wrote a series of Perl scripts that periodically mirrored these pages and rewrote them into a standard format.” Sept 2, 1993 Lycos (started at CMU) was launched and became a major commercial endeavor in 1994 Booming of search engine industry: Magellan, Excite, Infoseek, Inktomi, Northern Light, AltaVista, Yahoo!, Google, and Bing CS@UVa CS4501: Information Retrieval

Major players in this game Global search engine market - desktop By http://marketshare.hitslink.com/search-engine-market-share.aspx CS@UVa CS4501: Information Retrieval

Major players in this game Global search engine market - mobile By http://marketshare.hitslink.com/search-engine-market-share.aspx CS@UVa CS4501: Information Retrieval

How to perform information retrieval Information retrieval when we did not have a computer CS@UVa CS4501: Information Retrieval

How to perform information retrieval Crawler and indexer Query parser Document Analyzer Ranking model CS@UVa CS4501: Information Retrieval

How to perform information retrieval PARSING & INDEXING Repository Doc Rep Query Rep query Ranking User SEARCH results APPLICATIONS LEARNING judgments Evaluation FEEDBACK We will cover: 1) Search engine architecture; 2)Retrieval models; 3) Retrieval evaluation; 4) Relevance feedback; 5) Link analysis; 6) Search applications. CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Core concepts in IR Query representation Lexical gap: say v.s. said Semantic gap: ranking model v.s. retrieval method Document representation Specific data structure for efficient access Lexical gap and semantic gap Retrieval model Algorithms that find the most relevant documents for the given information need CS@UVa CS4501: Information Retrieval

A glance of modern search engine Yet Another Hierarchical Officious/Obstreperous/ Odiferous/Organized Oracle In old times CS@UVa CS4501: Information Retrieval

A glance of modern search engine Demand of understanding Modern time Demand of convenience Demand of efficiency Demand of accuracy Demand of diversity CS@UVa CS4501: Information Retrieval

IR is not just about web search Web search is just one important area of information retrieval, but not all Information retrieval also includes Recommendation CS@UVa CS4501: Information Retrieval

IR is not just about web search Web search is just one important area of information retrieval, but not all Information retrieval also includes Question answering CS@UVa CS4501: Information Retrieval

IR is not just about web search Web search is just one important area of information retrieval, but not all Information retrieval also includes Text mining CS@UVa CS4501: Information Retrieval

IR is not just about web search Web search is just one important area of information retrieval, but not all Information retrieval also includes Online advertising CS@UVa CS4501: Information Retrieval

IR is not just about web search Web search is just one important area of information retrieval, but not all Information retrieval also includes Enterprise search: web search + desktop search CS@UVa CS4501: Information Retrieval

Recap: what is information retrieval CS@UVa CS4501: Information Retrieval

Recap: why information retrieval Information overload Too much information to process Handling unstructured data Unknown semantic meaning CS@UVa CS4501: Information Retrieval

Recap: history of information retrieval “As We May Think” by Vannevar Bush, 1945 WWW and search engine Early days (late 1950s to 1960s): automatic indexing 1970s-1980s: retrieval models 1990s: TREC evaluation and Web search 2000s-present: more applications CS@UVa CS4501: Information Retrieval

Recap: IR architecture PARSING & INDEXING Repository Doc Rep Query Rep query Ranking User SEARCH results APPLICATIONS LEARNING judgments Evaluation FEEDBACK We will cover: 1) Search engine architecture; 2)Retrieval models; 3) Retrieval evaluation; 4) Relevance feedback; 5) Link analysis; 6) Search applications. CS@UVa CS4501: Information Retrieval

Recap: IR is not just web search Recommendation Netflix, Pandora Question answering Wolfram Alpha Text mining Topic modeling, sentiment analysis Online advertisement Behavior targeting, monetization CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Pop-up quiz Let a=(1,2,3) and b=(2,3,-2), the inner product between a and b is 0 (b) 1 (c) 2 (d) 3 Let A = ( 1 2 2 1 ), what is A-1, (a) ( −1 −2 −2 −1 ) (b) ( − 1 3 2 3 2 3 − 1 3 ) (c) ( 1 0 0 1 ) (d) ( 2 1 1 2 ) CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Pop-up quiz What is the expectation of random variables drawn from Gaussian distribution N(0, 1), 0 (b) 0.5 (c) 1 (d) 2 A biased coin with P(head)=0.2, in a sequence of 10 consecutive tossing, you have already got 9 tails, what is the probability you have a head at the 10th tossing, (a) 0 (b) 0.1 (c) 0.2 (d) 0.2*0.89 CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Pop-up quiz Let a=(1,2,3) and b=(2,3,-2), the inner product between a and b is 0 (b) 1 (c) 2 (d) 3 Let A = ( 1 2 2 1 ), what is A-1, (a) ( −1 −2 −2 −1 ) (b) ( − 1 3 2 3 2 3 − 1 3 ) (c) ( 1 0 0 1 ) (d) ( 2 1 1 2 ) (c) (b) CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Pop-up quiz What is the expectation of random variables drawn from Gaussian distribution N(0, 1), 0 (b) 0.5 (c) 1 (d) 2 A biased coin with P(head)=0.2, in a sequence of 10 consecutive tossing, you have already got 9 tails, what is the probability you have a head at the 10th tossing, (a) 0 (b) 0.1 (c) 0.2 (d) 0.2*0.89 (a) (c) (d) CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Related Areas Applications Mathematics Web Applications, Bioinformatics… Machine Learning Pattern Recognition Library & Info Science Information Retrieval Natural Language Processing Statistics Optimization Databases Data Mining Software engineering Computer systems Algorithms Systems CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval IR v.s. DBs Information Retrieval: Unstructured data Semantics of objects are subjective Simple keyword queries Relevance-drive retrieval Effectiveness is primary issue, though efficiency is also important Database Systems: Structured data Semantics of each object are well defined Structured query languages (e.g., SQL) Exact retrieval Emphasis on efficiency CS@UVa CS4501: Information Retrieval

IR and DBs are getting closer IR => DBs Approximate search is available in DBs Eg. in mySQL DBs => IR Use information extraction to convert unstructured data to structured data Semi-structured representation: XML data; queries with structured information mysql> SELECT * FROM articles -> WHERE MATCH (title,body) AGAINST ('database'); CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval IR v.s. NLP Information retrieval Computational approaches Statistical (shallow) understanding of language Handle large scale problems Natural language processing Cognitive, symbolic and computational approaches Semantic (deep) understanding of language (often times) small scale problems CS@UVa CS4501: Information Retrieval

IR and NLP are getting closer IR => NLP Larger data collections Scalable/robust NLP techniques, e.g., translation models NLP => IR Deep analysis of text documents and queries Information extraction for structured IR tasks CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Text books Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press, 2007. Search Engines: Information Retrieval in Practice. Bruce Croft, Donald Metzler, and Trevor Strohman, Pearson Education, 2009. CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Text books Modern Information Retrieval. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison-Wesley, 2011. Information Retrieval: Implementing and Evaluating Search Engines. Stefan Buttcher, Charlie Clarke, Gordon Cormack, MIT Press, 2010. CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval What to read? Applications Mathematics Web Applications, Bioinformatics… Machine Learning Pattern Recognition Library & Info Science ICML, NIPS, UAI Information Retrieval Statistics Optimization NLP SIGIR, WWW, WSDM, CIKM Databases ACL, EMNLP, COLING SIGMOD, VLDB, ICDE Data Mining Software engineering Computer systems KDD, ICDM, SDM Algorithms Systems Find more on course website for resource CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval IR in future Mobile search Desktop search + location? Not exactly!! Interactive retrieval Machine collaborates with human for information access Personal assistant Proactive information retrieval Knowledge navigator And many more You name it! CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval What you should know IR originates from library science for handling unstructured data IR has many important application areas, e.g., web search, recommendation, and question answering IR is a highly interdisciplinary area with DBs, NLP, ML, HCI CS@UVa CS4501: Information Retrieval

CS4501: Information Retrieval Today’s reading Bush, Vannevar. "As we may think." The atlantic monthly 176, no.1 (1945): 101-108. Introduction to Information Retrieval Chapter 1: Boolean Retrieval CS@UVa CS4501: Information Retrieval