Information Retrieval in Practice

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
XML: Extensible Markup Language
Beyond Bags of Words: A Markov Random Field Model for Information Retrieval Don Metzler.
DL:Lesson 11 Multimedia Search Luca Dini
Information Retrieval in Practice
1 Texmex – November 15 th, 2005 Strategy for the future Global goal “Understand” (= structure…) TV and other MM documents Prepare these documents for applications.
Information Retrieval in Practice
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
ADVISE: Advanced Digital Video Information Segmentation Engine
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,
Information Retrieval in Practice
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
ICS (072)Database Systems Background Review 1 Database Systems Background Review Dr. Muhammad Shafique.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Overview of Search Engines
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.
Querying Structured Text in an XML Database By Xuemei Luo.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Effective Query Formulation with Multiple Information Sources
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Algorithmic Detection of Semantic Similarity WWW 2005.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
MULTIMEDIA DATA MODELS AND AUTHORING
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Information Retrieval in Practice
Information Retrieval in Practice
Visual Information Retrieval
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Data and Applications Security Developments and Directions
Information Retrieval and Web Search
Nonparametric Semantic Segmentation
Information Retrieval and Web Search
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
A Markov Random Field Model for Term Dependencies
Multimedia Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008

Beyond Bag of Words “Bag of Words” Extending representation a document is considered to be an unordered collection of words with no relationships Extending representation feature-based models dependency models document structure question structure other media

Feature-Based Retrieval Models Linear feature-based model Some models support non-linear functions, but linear is more common

Linear Feature-Based Models To find best values for parameters need a set of training data T an evaluation function where RΛ is the set of rankings produced by the scoring function for all the queries Goal of a linear feature-based retrieval model is to find a parameter setting that maximizes E for the training data

Term Dependence Models Term dependence models do not assume that words occur independently of each other e.g., Markov Random Field (MRF) model construct a graph that consists of a document node and one node per query term undirected graphical model models the joint distribution over the document random variable and query term random variables models dependencies between random variables by drawing an edge between them

MRF Model Assumptions independence sequential dependence full dependence general dependence

MRF Model Define a set of potential functions over the cliques of the graph these are the features of the linear feature-based model e.g. sequential dependence model in Galago for query “president abraham lincoln”

Latent Concept Expansion Generalized version of pseudo-relevance feedback and relevance models Latent, or hidden, concepts are words or phrases that users have in mind but do not mention explicitly when they express a query latent concept expansion graph shows dependencies between query words and expansion words better probability estimates for expansion terms expansion features not just terms

Pseudo-Relevance Feedback Graphs Relevance model LCE model

LCE Example

Integrating Databases and IR Possible approaches Extending a database model to more effectively deal with probabilities Extending an information retrieval model to handle more complex structures and multiple relations Developing a unified model and system Applications such as web search, e-commerce, and data mining provide testbeds

Interaction of Search and Databases e.g., e-commerce applications such as Amazon

XML Retrieval XML is an important standard for both exchanging data between applications and encoding documents Database community has defined languages for describing the structure of XML data (XML Schema), and querying and manipulating that data (XQuery and XPath) query languages similar to SQL but must handle hierarchical structure XPath restricted to single document type

XML Retrieval INEX project studies XML retrieval models and techniques similar evaluation approach to TREC queries are specified using a simplified version of XPath called NEXI NEXI constructs include paths and path filters A path is a specification of an element (or node) in the XML tree structure A path filter restricts the results to those that satisfy textual or numerical constraints

NEXI Examples

INEX Examples

Entity Search Identify entities in text Construct “pseudo-documents” to represent entities based on words occurring near the entity over the whole corpus also called “context vectors” Retrieve ranked lists of entities instead of documents

Entity Search Example (organization search based on a TREC news corpus)

Expert Search Find “experts” for a given topic recent TREC track Rank candidate entities e by the joint distribution P(e, q) of entities and query terms P(q|e,d) involves ranking entities in those documents with respect to a query P(e|d) component corresponds to finding documents that provide information about an entity

Expert Search Assuming words and entities are independent leads to poor performance Instead estimate the strength of association between e and q using proximity of co-occurrence of the query words and the entities

Question Answering Providing answers instead of ranked lists of documents Older QA systems generated answers Current QA systems extract answers from large corpora such as the Web Fact-based QA limits range of questions to those with simple, short answers e.g., who, where, when questions

QA Architecture

Fact-Based QA Questions are classified by type of answer expected most categories correspond to named entities Category is used to identify potential answer passages Additional natural language processing and semantic inference used to rank passages and identify answer

Other Media Many other types of information are important for search applications e.g., scanned documents, speech, music, images, video Typically there is no associated text although user tagging is important in some applications Retrieval algorithms can be specified based on any content-related features that can be extracted

Noisy Text OCR and speech recognition produce noisy text i.e., text with numerous errors relative to the original printed text or speech transcript With good retrieval model, effectiveness of search is not significantly affected by noise due to redundancy of text problems with short texts

OCR Examples

Speech Example

Images and Video Feature extraction more difficult Features are low-level and not as clearly associated with the semantics of the image as a text description Typical features are related to color, texture, and shape e.g., color histogram “quantize” color values to define “bins” in a histogram for each pixel in the image, the bin corresponding to the color value for that pixel is incremented by one images can be ranked relative to a query image

Color Histogram Example peak in yellow

Texture and Shape Texture is spatial arrangement of gray levels in the image Shape features describe the form of object boundaries and edges Examples: shape

Video Video is segmented into shots or scenes continuous sequence of visually coherent frames boundaries detected by visual discontinuities Video represented by key frame images e.g., first frame in a shot

Image Annotation Given training data, can learn a joint probability model for words and image features Enables automatic text annotation of images current techniques are moderately effective errors

Music Music is even less associated with words than images Many different representations e.g., audio, MIDI, score Search based on features such as spectrogram peaks, note sequences, relative pitch, etc.