Information Retrieval and Web Search

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern Information Retrieval Chapter 1: Introduction
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Web Mining Research: A Survey
1 Information Retrieval and Web Search Introduction.
1 Web Information Retrieval Web Science Course. 2.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Information Retrieval in Practice
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR Lecture 1. Information Retrieval  Information retrieval is concerned with representing, searching, and manipulating large collections of electronic.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Web Technologies Search Engines
Search Engines and Information Retrieval Chapter 1.
Information Retrieval, Search, and Mining
고급정보검색론 Advanced Information Retrieval
1 Information Retrieval, Search, and Mining Introduction.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Search Engine Architecture
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Recuperação de Informação Cap. 01: Introdução 21 de Fevereiro de 1999 Berthier Ribeiro-Neto.
Information Retrieval
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
ITEC547 Text Mining Web Technologies Search Engines.
I NFORMATION R ETRIEVAL AND W EB S EARCH Jianping Fan Department of Computer Science UNC-Charlotte 1.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
CENG 776 Information Retrieval Nihan Kesim Çiçekli URL: 1/60.
Information Retrieval and Web Search Vasile Rus, PhD websearch/
Information Retrieval in Practice
Sampath Jayarathna Cal Poly Pomona
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Information Retrieval (in Practice)
Modern Information Retrieval
Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Web Search
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Information Retrieval and Web Search
Information Retrieval
Ying shen Sse, tongji university Jan. 2018
Text Categorization Assigning documents to a fixed set of categories
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Extraction
Web Mining Research: A Survey
Information Retrieval and Web Design
Recuperação de Informação
Presentation transcript:

Information Retrieval and Web Search Introduction

Information Retrieval (IR) The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the “killer app.” Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

Typical IR Task Given: Find: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

IR System Document corpus IR Query String System Ranked Documents .

Relevance Relevance is a subjective judgment and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

Problems with Keywords May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

Beyond Keywords We will cover the basics of keyword-based IR, but… We will focus on extensions and recent developments that go beyond keywords. We will cover the basics of building an efficient IR system, but… We will focus on basic capabilities and algorithms rather than systems issues that allow scaling to industrial size databases.

Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking

IR System Components Text Operations forms index words (tokens). Stopword removal Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric.

IR System Components (continued) User Interface manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results. Query Operations transform the query to improve retrieval: Query expansion using a thesaurus. Query transformation using relevance feedback.

Web Search Application of IR to HTML documents on the World Wide Web. Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

Web Search System Web Spider Document corpus IR Query String System Ranked Documents 1. Page1 2. Page2 3. Page3 .

Other IR-Related Tasks Automated document categorization Information filtering (spam filtering) Information routing Automated document clustering Recommending information or products Information extraction Information integration Question answering

History of IR 1960-70’s: Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area.

IR History Continued 1980’s: Large document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE

IR History Continued 1990’s: Searching FTPable documents on the Internet Archie WAIS Searching the World Wide Web Lycos Yahoo Altavista

IR History Continued 1990’s continued: Organized Competitions NIST TREC Recommender Systems Ringo Amazon NetPerceptions Automated Text Categorization & Clustering

IR History Continued 2000’s Link analysis for Web Search Google Automated Information Extraction Parallel Processing Map/Reduce Question Answering TREC Q/A track

IR History Continued 2000’s continued: Multimedia IR Cross-Language IR Image Video Audio and music Cross-Language IR DARPA Tides Document Summarization Learning to Rank

Recent IR History 2010’s Intelligent Personal Assistants Siri Cortana Google Now Alexa Complex Question Answering IBM Watson Distributional Semantics Deep Learning

Related Areas Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning

Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well-defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.

Library and Information Science Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.

Artificial Intelligence Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: First-order Predicate Logic Bayesian Networks Recent work on web ontologies and intelligent information agents brings it closer to IR.

Natural Language Processing Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

Natural Language Processing: IR Directions Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). Methods for identifying specific pieces of information in a document (information extraction). Methods for answering specific NL questions from document corpora or structured data like FreeBase or Google’s Knowledge Graph.

Machine Learning Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

Machine Learning: IR Directions Text Categorization Automatic hierarchical classification (Yahoo). Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering Clustering of IR query results. Automatic formation of hierarchies (Yahoo). Learning for Information Extraction Text Mining Learning to Rank