INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Information Retrieval Problem definition: Given a user’s information need, find documents satisfying that need “Document” is the generic term for an information holder (book, chapter, article, webpage, etc) Types of information: text, images/graphics, speech, video, etc. Text is still the most commonly used.
Information Retrieval Information Retrieval is a research-driven theoretical and experimental discipline The focus is on different aspects of the information– seeking process: Computer scientist – fast and accurate search engine Librarian – organization and indexing of information Cognitive scientist – the process in the searcher’s mind … Progress influenced by advances in Computational Linguistics, Information Visualization, Cognitive Psychology, HCI, …
Information Retrieval Basic principle: Document -> list of keywords / content-descriptors / terms User’s information need -> (natural-language) query -> list of Keywords Measure overlap between query and documents.
Stages of IR Indexed and structured information Information Retrieval Searching Browsing Indexing, organizing Creation
IR process Collection of documents Real world Document representations Query Information need Anomalous state of knowledge Matching Results
Document Representation: Indexing Inverted index
Vocabulary Vocabulary (indexing language): The set of concepts (terms or phrases) that can be used to index documents in a collection Controlled Specific for specialized domains Potential for increased consistency of indexing and precision of retrieval Un-controlled (free) Potentially all the terms in the documents Potential for increased recall
Indexing Tokenize: identify individual words. Stopword removal: eliminate common words, e.g. and, of, the, etc. Stemming: reduce words to a common root. e.g. analysis, analyze, analyzing -> analy, use standard algorithms (Porter). Thesaurus: find synonyms for words in the document. Phrases: find multi-word terms e.g. computer science, data mining. use syntax/linguistic methods or “statistical” methods. Named entities: identify names of people, organizations and places; dates; monetary or other amounts, etc.
Boolean Retrieval Model Keywords combined using AND, OR, (AND) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Efficient and easy to implement (list merging) AND : intersection OR : union Drawbacks OR — one match as good as many AND — one miss as bad as all no ranking
Term Weighting Any text item (“document”) is represented as list of terms and associated weights. Term = keywords or content-descriptors Weight = measure of the importance of a term in representing the information contained in the document
Vector Space Model Term frequency (tf): repeated words are strongly related to content Inverse document frequency (idf): uncommon term is more important Normalization by document length long docs. contain many distinct words. long docs. contain same word many times. term-weights for long documents should be reduced. use # bytes, # distinct words, Euclidean length, etc. Weight = tf x idf / normalization
Retrieval Measure vocabulary overlap between user query and documents. Use inverted index Cosine of the angle between document and query vectors Ranked retrieval
Query Expansion Searching depends on matching keywords between user- query and document Nature of language -> searchers and document creators may use different keywords to denote same “concept” Example: fatalities in road accidents on G.T. Road Vocabulary mismatch -> poor retrieval quality Problem aggravated by short queries + large, heterogeneous databases Solution: expand the query by adding related words/ phrases. Issues: select which terms to add to query calculate weights for added terms
Relevance Feedback Original query is used to retrieve some number of documents. User examines some of the retrieved documents and provides feedback about which documents are relevant and which are non- relevant. System uses the feedback to “learn” a better query: select/emphasize words that occur more frequently in relevant documents than non-relevant documents; eliminate/de-emphasize words that occur more frequently in non- relevant than in relevant documents. Resulting query should bring in more relevant documents and fewer non-relevant documents
Link/Citation Analysis In uncontrolled environments like WWW documents are uncontrolled, untrusted, commercial implications Presence of terms itself do not signify relevance Spamming Importance of author Link/Citation analysis
Page Rank Used in Google Search Engine ’Global’ ranking of every web page calculated based on hyperlink structure of web (content ignored) Documents with matching keywords returned in the global rank order Principle: Highly linked pages are more important than pages with a few links. A page has a high rank if the sum of the ranks of its back- links is high. Most effective for underspecified (general) queries
Page Rank
Open Source Search Engines Lucene Terrier Zettair ….. Lucene is the search engine used by Dspace
Lucene/Solr Architecture 20 Apache Lucene /select/spellXMLCSV XMLBinary JSON Data Import Handler (SQL/RSS) Extracting Request Handler ( PDF/WORD) CachingFaceting Query Parsing Apache Tika binary /admin High- lighting Schema Index Replication Request HandlersUpdate HandlersResponse Writers Query Search Components Spelling Faceting Highlightin g Signature Logging Update Processors Indexing Config Debug Statistics More like this Distributed Search Clustering FilteringSearch Core Search IndexReader/Search er Indexing IndexWriter Text Analysis Analysis
Evaluation Background User has an information need. Information need is converted into a query. Documents are relevant or non-relevant. Ideal system retrieves all and only the relevant documents.
Set Based Metrics
Evaluation Forums TREC, CLEF, NTCIR
References Introduction to Information Retrieval Manning, Raghavan, Schultz Lucene in Action Manning